DELE CA2 PART B : PENDULUM REINFORCEMENT LEARNING¶
Team Members : Dario Prawara Teh Wei Rong (2201858) | Lim Zhen Yang (2214506)
BACKGROUND RESEARCH & ANALYSIS¶
Reinforcement Learning (RL) is a type of machine learning that focuses on training agents to make decisions in an environment by maximizing a reward signal. The roots of RL actually stems all the way back to the 1930s and 40s, where Skinner presented his experimental research on the behaviour of animals. He described the concept of "operant conditioning" which involved manipulating the consequences of an animal's behaviour, in order to change the likelihood that the behaviour would occur in the future. (Skinner, 1991)
For example, one of his most famous experiments was the "Skinner Box" experiment, which studied operant conditioning. In this experiment, Skinner placed a rat in a box with a lever and a food dispenser, and demonstrated how the rat learned to press the lever to receive a food reward. This experiment helped Skinner develop his theory of operant conditioning, which states that behavior is shaped by consequences (rewards and punishments) that follow the behavior.
HOW REINFORCEMENT LEARNING WORKS¶
Any goal can be formalized as the outcome of maximizing a cumulative reward - Hado van Hasselt, DeepMind.com
Each of the algorithms revolve around an agent that plays in an environment. There are a few different types of components that an agent can contain. These are:
- Agent State
- Policy
- Value Function Estimate - Optional
- Model - Optional

Image Credits: deepmind.com
KEY CHALLENGES IN REINFORCEMENT LEARNING¶
Exploration-Exploitation¶
When an agent is initialized and put into a new environment, the optimal actions it should take are essentially random, in that the agent does not possess any knowledge of what to do, or what the task to tackle even is. Only when it interacts with environment, gain knowledge from data and learn the optimal actions, does it improve. However, "reliance on data" could possibly lead to two different scenarios. (Wang, Zariphopoulou and Zhou, 2019)
Exploitation: The agent learns that a certain action returns some reward. Because the goal is to maximize the total reward, the agent then continuous to maximize the reward by repeatedly exploiting this specific knowledge or performing this move. As one can imagine, if the agent has not ultimately visited a large enough action space, this knowledge may lead to a suboptimal policy (Wiering, 1999).
Exploration: Take actions that currently don't possess the maximum expected reward, to learn more about the environment and realize better options for the future. However, the agent focusing solely on learning new knowledge, will lead to a potential waste of resource, time and opportunities.
Thus, the agent must learn to balance the trade-off between the exploring and exploiting, to learn the actions that will ultimately lead to the maximum optimal policy.
What are some approaches to tackle this issue? The simplest way is to randomly choose; every move there is a 50% chance to explore, and the other 50% to exploit. One may then realize that infact, a much smarter move would be have some sort of parameter epsilon $\epsilon$, that controls the probability to exploit, with the probability to explore being 1 - $\epsilon$. By doing this, $\epsilon$ can now be tuned to maximize the policy, which empirically is much. (Bather, 1990)
Delayed Reward¶
Usually, unlike in Supervised Learning, agents do not get immediate feedback on a per action basis. Rather, the reward system is attributed towards a sequence of actions. This means that agents must be considerate of the possibility that taking greedy approaches (essentially trying to retrieve immediate rewards) may result in less future reward.
APPLICATIONS & USES OF REINFORCEMENT LEARNING¶
What are the uses of RL?¶
It can be used to optimize decision making in systems where the decision maker does not have complete information about the system or the consequences of its actions. Additionally, it may be used to control systems that are difficult to model completely under mathematical equations, such as robots that must operate in uncertain environments. RL can also be used in control systems like robotics, games and autonomous systems.
For example, Boston Dynamics has used reinforcement learning to train its robots to balance and walk on rough terrain, such as rocks or uneven surfaces. The robots receive rewards for maintaining balance and penalties for falling over, allowing them to learn to walk more stably and efficiently over time.

Boston Dynamics Robot (Image Credits: bostondynamics.com)
RL has proven to be a powerful tool for Boston Dynamics in their development of advanced robots, allowing them to perform complex and dynamic tasks in real-world environments with greater stability and robustness. (Pineda-Villavicencio, Ugon and Yost, 2018)
OUR PROJECT OBJECTIVE¶
Before we begin, let us take a look at our project's objective.
Using OpenAI Gym, apply a suitable modification of deep Q-network (DQN) architecture to the problem. The model must exert some appropriate torque on the pendulum to balance it.
BACKGROUND INFORMATION¶
Pendulum is part of the five classic control environments. They are stochastic in terms of their initial state, within a given range.
The inverted pendulum swingup problem is based on the classic problem in control theory. The system consists of a pendulum attached at one end to a fixed point, and the other end being free. The pendulum starts in a random position and the goal is to apply torque on the free end to swing it into an upright position, with its center of gravity right above the fixed point.
Action Space - The pendulum can only perform one action (torque).
- A
ndarraywith shape(1,)representing the torque applied to free end of the pendulum with a range from -2.0 to 2.0.
Observation Space - There are a total of 3 distinct components in the observation space.
- Coordinates of the Pendulum in x = cos(theta)
- Coordinates of the Pendulum in y = sin(theta)
- Angular Velocity of the Pendulum
Rewards Granted - For each time step, the reward :
- is decreased as the pendulum deviates further from the upright position (closer to θ = 0).
- is decreased as the pendulum's angular velocity increases (faster movement).
- is decreased as the pendulum tilts away from the vertical position.
An episode is considered successful if it achieves a minimum cumulative reward of -16.2736044 (the minimum possible reward) or a maximum reward of 0, representing the pendulum being perfectly upright and balanced (no torque applied).
The pendulum starts at a random angle in [-pi, pi] and a random angular velocity in [-1, 1] and the episode truncates at 200 time steps.
INITIALIZING MODULES AND LIBRARIES¶
Import necessary libraries for pre-processing, data exploration, feature engineering and model evaluation.
Some libraries used include pytorch, numpy, matplotlib, and gym.
# Import the necessary modules and libraries
# Gym and Environment Handling
import gym
# Numerical and Visualization Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib import animation, rc
import seaborn as sns
from torchinfo import summary
# Display and Visualization
from IPython import display as ipythondisplay
from pyvirtualdisplay.display import Display
from IPython.display import clear_output, display
# PyTorch for Neural Networks and Optimization
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.distributions as distributions
from torch.distributions import Normal
# Utility and Miscellaneous
import os
import random
import copy
import datetime
from collections import deque, namedtuple
# Hyperparameter tuning
from ray import tune, train
from ray.train import Checkpoint, session
from ray.tune.schedulers import ASHAScheduler
from functools import partial
import tempfile
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")
PERFORM CHECK FOR GPU¶
- Ensure GPU can be found when using torch.cuda.is_available().
- If returned
True, it means that the GPU is working as expected for PyTorch.
torch.cuda.is_available()
True
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
cuda:0
SETTING UTILITY FUNCTIONS AND CLASSES¶
- Before beginning our analysis, we will first define some utility functions and classes that will be used later for analysis or evaluation of our models.
- Some utility functions used will be
plot_agent_performance, which is used to plot the charts to visualize changes in reward obtained.
# Function to plot the performance of the model over time
def plot_agent_performance(scores, average_reward, model_name="Random Agent"):
"""
Plots the performance of an agent.
Parameters:
scores (list): A list of scores representing the agent's performance in each episode.
average_reward (float): The average reward across all episodes.
model_name (str): The name of the model/agent.
"""
# Creating subplots: 1 row, 2 columns
plt.figure(figsize=(15, 6))
# First subplot: Reward over Episodes
plt.subplot(1, 2, 1)
plt.plot(scores, label='Reward per Episode')
plt.axhline(y=average_reward, color='r', linestyle='-', label='Average Reward')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title(f'Reward over Episodes for {model_name}')
plt.legend()
# Second subplot: Histogram of Rewards
plt.subplot(1, 2, 2)
plt.hist(scores, bins=20, alpha=0.7)
plt.axvline(x=average_reward, color='r', linestyle='-', label='Average Reward')
plt.xlabel('Total Reward')
plt.ylabel('Frequency')
plt.title(f'Distribution of Rewards for {model_name}')
plt.legend()
# Display the subplots
plt.tight_layout()
plt.show()
# Creating an animation function
def create_animation(frames, filename=None):
rc("animation", html="jshtml")
fig = plt.figure()
plt.axis("off")
im = plt.imshow(frames[0], animated=True)
def updatefig(i):
im.set_array(frames[i])
return im,
animationFig = animation.FuncAnimation(fig, updatefig, frames=len(frames), interval=len(frames)/10, blit=True, repeat=False)
ipythondisplay.display(ipythondisplay.HTML(animationFig.to_html5_video()))
if filename != None:
animationFig.save(filename, writer='imagemagick')
return animationFig
# Function to test agent weights
def test_agent(agent, type):
env = gym.make('Pendulum-v1', g=9.81)
frames = []
state = env.reset()
done = False
cumulative_reward = 0 # Initialize cumulative reward
while not done:
if type == 'SAC':
action, _ = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([action])
else:
action = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([action])
cumulative_reward += reward # Accumulate reward
state = state_prime
screen = env.render(mode='rgb_array')
frames.append(screen)
env.close()
print(f'Test reward: {cumulative_reward}') # Print cumulative reward
create_animation(frames)
# Initialize the RunningCalc class
class RunningCalc:
class Node:
def __init__(self, val):
self.val = val
self.next = None
def __init__(self, limit=10):
self.head = None
self.tail = None
self.count = 0
self.limit = limit
self.total = 0
def add(self, val):
self.count += 1
if self.count > self.limit:
self.total -= self.head.val
self.head = self.head.next
self.count -= 1
if self.head is None and self.tail is None:
self.head = self.Node(val)
self.tail = self.head
else:
newNode = self.Node(val)
self.tail.next = newNode
self.tail = newNode
self.total += val
def calc(self):
return self.total
# Initalize the Tracker class to track rewards over time
class Tracker:
def __init__(self):
self.running = {}
self.reward = {}
self.success = {}
self.name = None
def add(self, name, running, reward, success_rate):
if name in self.running.keys():
self.running[name].append(running)
else:
self.running[name] = [running]
if name in self.reward.keys():
self.reward[name].append(reward)
else:
self.reward[name] = [reward]
if name in self.success.keys():
self.success[name].append(success_rate)
else:
self.success[name] = [reward]
print(f"{name} | Running 200 Reward: {running} | Reward: {reward} | Running Success Rate: {success_rate} ")
def plot(self, name, metric):
fig = plt.figure()
fig.suptitle(f"{name} | {metric}")
ax = fig.subplots()
if metric == 'success':
ax.plot(self.success[name])
else:
ax.plot([200 for i in range(len(self.reward[name]))], label='Solve', linestyle='--')
ax.plot(self.reward[name], label='Reward', color=sns.color_palette('pastel')[0])
ax.plot(self.running[name], label='Running', color=sns.color_palette('pastel')[1], linestyle='--')
plt.legend()
def plot_all(self, metric):
fig = plt.figure()
ax = fig.subplots()
ax.set_xlabel("Episodes (in 20s)")
if metric == 'success':
fig.suptitle("All Success")
score = self.success
for i, name in enumerate(list(sorted(self.reward.keys()))):
ax.plot(self.success[name], label=f'{name}', color=sns.color_palette('Paired')[1 + i * 2])
ax.set_ylabel("Success Rate")
plt.legend()
elif metric == 'reward':
fig.suptitle("All Rewards")
first = list(self.reward.keys())[0]
ax.plot([200 for i in range(len(self.reward[first]))], label='Solve', linestyle='--')
for i, name in enumerate(list(sorted(self.reward.keys()))):
ax.plot(self.running[name], label=f'{name}', color=sns.color_palette('Paired')[1 + i * 2])
ax.plot(self.reward[name], color=sns.color_palette('Paired')[0 + i * 2], linestyle='--')
ax.set_ylabel("Episode Reward")
plt.legend()
SETTING CHART CUSTOMIZATIONS FOR EDA¶
- Before loading the pendulum environment from OpenAI Gym, we will be setting chart customizations in Seaborn to ensure a consistent and uniformed layout for our charts in this notebook.
# Change theme of charts
sns.set_theme(style='darkgrid')
# Change font of charts
sns.set(font='Century Gothic')
# Variable for color palettes
color_palette = sns.color_palette('muted')
LOADING THE PENDULUM ENVIRONMENT¶
- We will be using the OpenAI Gym environment under Classic Control to make the Pendulum-v1 environment.
- To load the environment / animation, we will make use of Matplotlib's animation function and ipythondisplay.
To visualize what the animation looks like, we will be displaying the environment by running 200 time steps for the pendulum using the provided gym.make("Pendulum-v1") import statement.
# Setting up the environment
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 200 time steps
frames = []
for i in range(200):
action = env.action_space.sample()
obs, reward, done, info = env.step(action)
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
INFORMATION ON THE PENDULUM COORDINATE SYSTEM¶
To implement the pendulum's dynamic equations, we will be utilizing the pendulum's coordinate system as shown below :
x-y: cartesian coordinates of the pendulum's end in meters.theta: angle in radians.tau: torque in Nm. Defined as positive counter-clockwise.
EXPLORATORY DATA ANALYSIS¶
First, we will conduct some simple exploratory data analysis (EDA) of the pendulum environment, allowing us to better understand the different actions and how they affect the pendulum's movement. Some things we are looking at will include :
- Observation Space Analysis
- Action Space Analysis
- Testing the Actions for the Pendulum
OBSERVATION SPACE ANALYSIS
- Based on the results, we can see that as stated in the background information, there are a total of 3 observation spaces for the pendulum.
- For
obs_low, the values -1, -1, -8 represent the smallest possible values for each of the 3 dimensions, which are the x-coord, y-coord and angular velocity. (Same forobs_highwith the values 1, 1, 8).
# Finding the minimum and maximum allowable values for each dimension of observation
obs_low = env.observation_space.low
obs_high = env.observation_space.high
print('Number of Observation Space: ', env.observation_space.shape)
print("Observation Space Low:", obs_low)
print("Observation Space High:", obs_high)
Number of Observation Space: (3,) Observation Space Low: [-1. -1. -8.] Observation Space High: [1. 1. 8.]
ACTION SPACE ANALYSIS
- From our analysis, we see that the type of action space is continuous, with an action range of values from -2.0 to 2.0.
- As for the action shape, it is a single floating-point number (scalar) and are represented as 32-bit floating point numbers.
print('Number of Actions: ', env.action_space)
Number of Actions: Box(-2.0, 2.0, (1,), float32)
TESTING ACTIONS AND ITS EFFECTS ON THE PENDULUM
Now, we will be looking into how each action can affect the pendulum. In the case of the pendulum, there are no discrete actions (meaning that the actions the pendulum can perform are infinite). Hence, we have selected just 5 types of actions the pendulum could possibly make and will be looking more into these specific actions for our EDA :
- Zero Torque
- Positive Maximum Torque (2.0)
- Negative Maximum Torque (-2.0)
- Gradual Increase in Torque
- Gradual Decrease in Torque
ACTION 1 : ZERO TORQUE
- We see that by applying zero torque, there is no external force applied to the pendulum, hence there is limited movement by the pendulum and will tend to stabilize in a downward position over time.
- The pendulum tends to stabilize in a downward position due to gravitational force, which affects its ability to gain rewards or maintain a specific upright position as encouraged by the reward system.
# Zero Torque
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 300 time steps
frames = []
for i in range(300):
obs, reward, done, info = env.step([0.0])
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
ACTION 2 : POSITIVE TORQUE [2.0]
- We see that with the maximum positive torque applied, it exerts the highest possible force where the pendulum will swing to the right then left (from its downward position).
- The positive torque helps the pendulum gain more angular momentum, but reduces stability of the pendulum around the upright position hence will result in lower rewards within the environment.
# Positive Torque
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 300 time steps
frames = []
for i in range(300):
obs, reward, done, info = env.step([2.0])
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
ACTION 3 : NEGATIVE TORQUE [-2.0]
- We see that with the maximum negative torque applied, it exerts the highest possible force where the pendulum will swing to the left then right (from its downward position).
- The negative torque helps the pendulum gain more angular momentum, but reduces stability of the pendulum around the upright position hence will result in lower rewards within the environment.
# Negative Torque
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 300 time steps
frames = []
for i in range(300):
obs, reward, done, info = env.step([-2.0])
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
ACTION 4 : GRADUAL INCREASE IN TORQUE
- We see that a gradual increase in torque allows for exploration of a broader stability landscape, observing how the pendulum responds to rising forces and identifying stability regions.
- The gradual increase in torque leads to a systematic rise in the force applied to the pendulum. This incrementally alters the pendulum's behavior, potentially causing wider swings or movements in the opposite direction to its natural hanging position.
# Gradual Increase in Torque
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 300 time steps
frames = []
for i in range(300):
obs, reward, done, info = env.step([-2.0 + (i * 0.08)])
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
ACTION 5 : GRADUAL DECREASE IN TORQUE
- We see that a gradual decrease in torque explores stability concerning diminishing forces, potentially observing how the pendulum's movement changes as force reduces.
- The gradual decrease in torque systematically reduces the applied force. This decrement may gradually slow down the pendulum's movement or bring it closer to the natural downward position, which could aid in stability but limit exploration for more optimal strategies.
# Gradual Decrease in Torque
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 300 time steps
frames = []
for i in range(300):
obs, reward, done, info = env.step([2.0 - (i * 0.08)])
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
SUMMARY ANALYSIS OF TORQUE MOVEMENTS
We found that applying torque to the pendulum triggers substantial changes in its swinging behavior. High positive torque leads to forceful swings away from the natural downward position, resulting in decreased stability and penalties within the reward system. Conversely, negative torque slows movement in the opposite direction, potentially aiding stability, yet limited effectiveness incurs penalties due to deviations from the desired position.
This could indicate that lower torque may provide higher rewards, as it encourages stability in the pendulum's movement.
Moreover, the reward system penalizes excessive movement, high velocities, and deviations from the desired stable state caused by high torque, resulting in reduced overall rewards. However, gradual changes in torque offer opportunities for systematic exploration, aiding in learning and potentially optimizing strategies for balancing the pendulum while minimizing the penalties incurred in the rewards system.
MODEL DEVELOPMENT & EXPLORATION¶
Upon gathering insights from our EDA, we will now be proceeding to build and test a few reinforcement learning models to help balance the pendulum by exerting an appropriate level of torque.
We will be testing with the following models :
- Random Action Model (Baseline Model)
- Simple Deep Q Network (DQN)
- Enhanced Deep Q Network (Improved Model)
- Double Deep Q Network (DDQN)
- Soft Actor-Critic Network (SAC)
In this RL analysis, we will be diving deeper into DQN-related architectures compared to other models to demonstrate its viability in solving the Pendulum task.
MODEL 1 : RANDOM ACTION MODEL - BASELINE¶
For random action model, it is a baseline as it operates by making decisions solely based on random selection from the available action space, and does not take in any considerations related to the environment's state or learning strategies.
This model will serve as a fundamental benchmark for us to evaluate the performance of more advanced models later on, such as Deep Q Network.
CREATING AN AGENT THAT TAKES RANDOM ACTIONS
Due to the possibility that the pendulum's episode may go on forever, we will set a fixed limit to the number of steps per episode for the pendulum at 200, to prevent the episode from running indefinitely. We will be setting our number of episodes to 800 to give us a benchmark of how well our next few models should perform.
# Create the Gym environment for Pendulum with specified gravity and render mode
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
# Initialize an array to store scores for visualization
total_rewards = []
frames = []
# Define the maximum number of episodes and steps per episode
MAX_EPISODES = 800
MAX_STEP_PER_EPISODE = 200
# Loop through the episodes using a for loop
for i in range(MAX_EPISODES):
state = env.reset()
total_reward = 0
done = False
start_time = datetime.datetime.now()
# Loop through the maximum steps per episode
for step in range(MAX_STEP_PER_EPISODE):
action = env.action_space.sample() # Select a random action from the action space
state, reward, done, info = env.step(action) # Apply the action and observe the result
total_reward += reward
if step % 30 == 0 and total_reward > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
elapsed_time = datetime.datetime.now() - start_time
if i % 10 == 0:
print('Episode {:>4} | Total Reward: {:>8.2f} | Elapsed: {}'.format(i, total_reward, elapsed_time))
total_rewards.append(total_reward)
# Close the environment
env.close()
Episode 0 | Total Reward: -1802.22 | Elapsed: 0:00:00.411252 Episode 10 | Total Reward: -1321.27 | Elapsed: 0:00:00.015025 Episode 20 | Total Reward: -1291.68 | Elapsed: 0:00:00.015692 Episode 30 | Total Reward: -992.26 | Elapsed: 0:00:00.015252 Episode 40 | Total Reward: -1534.34 | Elapsed: 0:00:00.015068 Episode 50 | Total Reward: -1617.21 | Elapsed: 0:00:00.015047 Episode 60 | Total Reward: -1170.23 | Elapsed: 0:00:00.013512 Episode 70 | Total Reward: -1198.94 | Elapsed: 0:00:00.014853 Episode 80 | Total Reward: -1304.78 | Elapsed: 0:00:00.015785 Episode 90 | Total Reward: -903.80 | Elapsed: 0:00:00.014683 Episode 100 | Total Reward: -886.82 | Elapsed: 0:00:00.013750 Episode 110 | Total Reward: -894.23 | Elapsed: 0:00:00.013363 Episode 120 | Total Reward: -755.78 | Elapsed: 0:00:00.019042 Episode 130 | Total Reward: -917.89 | Elapsed: 0:00:00.015532 Episode 140 | Total Reward: -1167.00 | Elapsed: 0:00:00.014530 Episode 150 | Total Reward: -1189.97 | Elapsed: 0:00:00.020365 Episode 160 | Total Reward: -1182.69 | Elapsed: 0:00:00.014773 Episode 170 | Total Reward: -1019.11 | Elapsed: 0:00:00.016037 Episode 180 | Total Reward: -969.14 | Elapsed: 0:00:00.016114 Episode 190 | Total Reward: -1060.26 | Elapsed: 0:00:00.013517 Episode 200 | Total Reward: -900.67 | Elapsed: 0:00:00.018020 Episode 210 | Total Reward: -1054.46 | Elapsed: 0:00:00.015009 Episode 220 | Total Reward: -1071.76 | Elapsed: 0:00:00.016130 Episode 230 | Total Reward: -1291.16 | Elapsed: 0:00:00.016550 Episode 240 | Total Reward: -964.53 | Elapsed: 0:00:00.014381 Episode 250 | Total Reward: -1696.45 | Elapsed: 0:00:00.018044 Episode 260 | Total Reward: -1546.35 | Elapsed: 0:00:00.014513 Episode 270 | Total Reward: -967.59 | Elapsed: 0:00:00.014515 Episode 280 | Total Reward: -1330.98 | Elapsed: 0:00:00.015257 Episode 290 | Total Reward: -1276.31 | Elapsed: 0:00:00.025327 Episode 300 | Total Reward: -1448.81 | Elapsed: 0:00:00.019039 Episode 310 | Total Reward: -969.73 | Elapsed: 0:00:00.016753 Episode 320 | Total Reward: -917.34 | Elapsed: 0:00:00.027612 Episode 330 | Total Reward: -992.74 | Elapsed: 0:00:00.019041 Episode 340 | Total Reward: -997.48 | Elapsed: 0:00:00.015070 Episode 350 | Total Reward: -1359.94 | Elapsed: 0:00:00.015710 Episode 360 | Total Reward: -1217.04 | Elapsed: 0:00:00.015013 Episode 370 | Total Reward: -1333.30 | Elapsed: 0:00:00.017028 Episode 380 | Total Reward: -972.93 | Elapsed: 0:00:00.015113 Episode 390 | Total Reward: -927.15 | Elapsed: 0:00:00.015857 Episode 400 | Total Reward: -1402.74 | Elapsed: 0:00:00.014513 Episode 410 | Total Reward: -866.96 | Elapsed: 0:00:00.016380 Episode 420 | Total Reward: -868.44 | Elapsed: 0:00:00.014042 Episode 430 | Total Reward: -892.04 | Elapsed: 0:00:00.015070 Episode 440 | Total Reward: -1345.45 | Elapsed: 0:00:00.013859 Episode 450 | Total Reward: -1051.27 | Elapsed: 0:00:00.015856 Episode 460 | Total Reward: -1476.64 | Elapsed: 0:00:00.014025 Episode 470 | Total Reward: -1347.09 | Elapsed: 0:00:00.015376 Episode 480 | Total Reward: -1427.48 | Elapsed: 0:00:00.015203 Episode 490 | Total Reward: -1189.14 | Elapsed: 0:00:00.015038 Episode 500 | Total Reward: -1500.24 | Elapsed: 0:00:00.014024 Episode 510 | Total Reward: -1488.33 | Elapsed: 0:00:00.016121 Episode 520 | Total Reward: -939.01 | Elapsed: 0:00:00.014393 Episode 530 | Total Reward: -1673.15 | Elapsed: 0:00:00.014360 Episode 540 | Total Reward: -1288.93 | Elapsed: 0:00:00.015143 Episode 550 | Total Reward: -1458.60 | Elapsed: 0:00:00.015359 Episode 560 | Total Reward: -1403.01 | Elapsed: 0:00:00.014623 Episode 570 | Total Reward: -1292.03 | Elapsed: 0:00:00.015744 Episode 580 | Total Reward: -849.16 | Elapsed: 0:00:00.015180 Episode 590 | Total Reward: -1720.54 | Elapsed: 0:00:00.015425 Episode 600 | Total Reward: -773.16 | Elapsed: 0:00:00.013536 Episode 610 | Total Reward: -766.59 | Elapsed: 0:00:00.014706 Episode 620 | Total Reward: -1544.38 | Elapsed: 0:00:00.015905 Episode 630 | Total Reward: -1449.55 | Elapsed: 0:00:00.014895 Episode 640 | Total Reward: -1339.64 | Elapsed: 0:00:00.015521 Episode 650 | Total Reward: -829.12 | Elapsed: 0:00:00.015473 Episode 660 | Total Reward: -1444.76 | Elapsed: 0:00:00.015420 Episode 670 | Total Reward: -910.04 | Elapsed: 0:00:00.017223 Episode 680 | Total Reward: -753.93 | Elapsed: 0:00:00.014641 Episode 690 | Total Reward: -1520.10 | Elapsed: 0:00:00.015305 Episode 700 | Total Reward: -1487.44 | Elapsed: 0:00:00.015518 Episode 710 | Total Reward: -1651.39 | Elapsed: 0:00:00.014301 Episode 720 | Total Reward: -758.29 | Elapsed: 0:00:00.015094 Episode 730 | Total Reward: -1146.98 | Elapsed: 0:00:00.015406 Episode 740 | Total Reward: -1266.38 | Elapsed: 0:00:00.014977 Episode 750 | Total Reward: -1441.06 | Elapsed: 0:00:00.015604 Episode 760 | Total Reward: -882.69 | Elapsed: 0:00:00.015686 Episode 770 | Total Reward: -1009.75 | Elapsed: 0:00:00.014754 Episode 780 | Total Reward: -912.81 | Elapsed: 0:00:00.015169 Episode 790 | Total Reward: -1045.95 | Elapsed: 0:00:00.014151
VISUALIZING THE PERFORMANCE OF RANDOM AGENT MODEL
- From our baseline agent, we note that it performs random actions, hence none of them got close to 0 (or above 0).
- It is also clear that there is no clear learning happening as there is no AI model learning the patterns from the random agent model. Hence, the results obtained are erratic with no visible improvement.
- From our results, we can see that for the baseline model, the best episode is 157, with a score of -586.06. This score simply indicates that the pendulum failed to balance, hence this model is not good enough to solve the Pendulum task.
# Calculating statistical measures
average_reward = np.mean(total_rewards)
median_reward = np.median(total_rewards)
max_reward = np.max(total_rewards)
min_reward = np.min(total_rewards)
# Identifying the best episode
best_episode_index = np.argmax(total_rewards)
# Neatly formatted output
print("Performance Statistics for the Random Agent:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(total_rewards, average_reward, model_name="Random Agent")
Performance Statistics for the Random Agent: -------------------------------------------- Best Episode : 61 Average Reward : -1219.34 Median Reward : -1179.79 Maximum Reward : -728.41 Minimum Reward : -1830.10
VISUALIZING THE PENDULUM ANIMATION FOR THE RANDOM ACTION MODEL
- Now, we will be looking at the pendulum's animation for the random action model and how it moves.
create_animation(frames)
MODEL 2 : SIMPLE DEEP Q NETWORK (DQN)¶
DQN (Deep Q-Network) is a reinforcement learning algorithm that combines Q-Learning with deep neural networks to estimate the Q-value function. The goal of DQN is to find a policy that maximizes the expected cumulative reward in an environment, by using the neural network to approximate the Q-value for each possible action in a given state. This allows DQN to scale to high-dimensional state spaces and solve more complex problems than traditional Q-Learning methods.
In reinforcement learning, the Q-value function represents the expected cumulative reward from taking a certain action in a certain state and following a specific policy thereafter. DQN uses a neural network to approximate the Q-value function and make decisions about which action to take in each state. The network is trained on a dataset of state-action-reward transitions generated by interacting with the environment. The training process updates the network weights so that the estimated Q-values for each action become more accurate over time.
One key innovation of DQN is the use of experience replay, which is a technique for storing and reusing previously observed state-action-reward transitions to decorrelate the samples and improve the stability of the learning process. Another important aspect of DQN is the use of target networks, which are separate networks that are used to stabilize the training of the primary network. The target network's weights are updated less frequently than the primary network's weights, which helps prevent overfitting and stabilize the learning process.
INITALIZING AND CREATING THE REPLAYBUFFER CLASS
- Here, before defining the model architecture, we will be defining the
ReplayBufferclass, which serves as a memory storage system in RL tasks. - It is designed to store and manage past experiences (transitions of an agent interacting with its environment. By creating a ReplayBuffer, it allows for the efficient storage and retrieval of these experiences, ensuring that the agent learns from a diverse set of historical interactions.
class ReplayBuffer:
def __init__(self, buffer_limit):
self.buffer = deque(maxlen=buffer_limit)
def put(self, transition):
self.buffer.append(transition)
def sample(self, n):
mini_batch = random.sample(self.buffer, n)
s_lst, a_lst, r_lst, s_prime_lst, done_mask_lst = [], [], [], [], []
for transition in mini_batch:
s, a, r, s_prime, done = transition
s_lst.append(s)
a_lst.append([a])
r_lst.append([r])
s_prime_lst.append(s_prime)
done_mask = 0.0 if done else 1.0
done_mask_lst.append([done_mask])
s_batch = torch.tensor(s_lst, dtype=torch.float)
a_batch = torch.tensor(a_lst, dtype=torch.float)
r_batch = torch.tensor(r_lst, dtype=torch.float)
s_prime_batch = torch.tensor(s_prime_lst, dtype=torch.float)
done_batch = torch.tensor(done_mask_lst, dtype=torch.float)
return s_batch, a_batch, r_batch, s_prime_batch, done_batch
def size(self):
return len(self.buffer)
SETTING UP THE MODEL ARCHITECTURE FOR THE SIMPLE DQN MODEL
- As DQN Model is typically used for discrete action spaces, we will be discretizing the continuous action space for the Pendulum task for this variation of DQN.
- Later on, we will also explore other variations of DQN to see how our adjustments will affect the model's performance.
- For this model architecture, we will be training it on 1000 episodes and evaluate how the reward changes.
# Defining the QNetwork class for the DQN Agent
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim, q_lr):
super(QNetwork, self).__init__()
self.fc_1 = nn.Linear(state_dim, 64)
self.fc_2 = nn.Linear(64, 32)
self.fc_out = nn.Linear(32, action_dim)
self.lr = q_lr
self.optimizer = optim.Adam(self.parameters(), lr=self.lr)
def forward(self, x):
q = F.leaky_relu(self.fc_1(x))
q = F.leaky_relu(self.fc_2(q))
q = self.fc_out(q)
return q
# Creating a class for the DQN Agent
class DQNAgent:
def __init__(self):
self.state_dim = 3
self.action_dim = 9
self.lr = 0.01
self.gamma = 0.98
self.tau = 0.01
self.epsilon = 1.0
self.epsilon_decay = 0.98
self.epsilon_min = 0.001
self.buffer_size = 100000
self.batch_size = 200
self.memory = ReplayBuffer(self.buffer_size)
self.Q = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target.load_state_dict(self.Q.state_dict())
def choose_action(self, state):
random_number = np.random.rand()
maxQ_action_count = 0
if self.epsilon < random_number:
with torch.no_grad():
action = float(torch.argmax(self.Q(state)).numpy())
real_action = (action - 4) / 4
maxQ_action_count = 1
else:
action = np.random.choice([n for n in range(9)])
real_action = (action - 4) / 2
return action, real_action, maxQ_action_count
def calc_target(self, mini_batch):
s, a, r, s_prime, done = mini_batch
with torch.no_grad():
q_target = self.Q_target(s_prime).max(1)[0].unsqueeze(1)
target = r + self.gamma * done * q_target
return target
def train_agent(self):
mini_batch = self.memory.sample(self.batch_size)
s_batch, a_batch, r_batch, s_prime_batch, done_batch = mini_batch
a_batch = a_batch.type(torch.int64)
td_target = self.calc_target(mini_batch)
# QNetwork training
Q_a = self.Q(s_batch).gather(1, a_batch)
q_loss = F.smooth_l1_loss(Q_a, td_target)
self.Q.optimizer.zero_grad()
q_loss.mean().backward()
self.Q.optimizer.step()
# QNetwork Soft Update
for param_target, param in zip(self.Q_target.parameters(), self.Q.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)
def train_DQNAgent():
# Initalize the DQN Agent and related variables required
agent = DQNAgent()
env = gym.make('Pendulum-v1', g=9.81)
episodes = 800
total_rewards = []
frames = []
no_of_steps = []
success_count = 0
best_episode = 0
best_reward = float('-inf')
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
maxQ_action_count = 0
start_time = datetime.datetime.now()
while not done:
action, real_action, count = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([real_action])
agent.memory.put((state, action, reward, state_prime, done))
score += reward
maxQ_action_count += count
state = state_prime
if maxQ_action_count % 100 == 0 and score > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if agent.memory.size() > 1000:
agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200
total_rewards.append(score)
no_of_steps.append(maxQ_action_count)
if score > best_reward:
best_reward = score
best_episode = episode
# Saving the Models
save_folder = "DQN"
if not os.path.exists(save_folder):
os.makedirs(save_folder)
if episode == best_episode:
model_Q = os.path.join(save_folder, "DQN" + str(episode) + ".pt")
torch.save(agent.Q.state_dict(), model_Q)
if episode % 10 == 0:
elapsed_time = datetime.datetime.now() - start_time
print('Episode {:>4} | Total Reward: {:>8.2f} | MaxQ_Action_Count:{:>5} | Epsilon: {:>4.4f} | Elapsed: {}'.format(episode, score, maxQ_action_count, agent.epsilon, elapsed_time))
if agent.epsilon > agent.epsilon_min:
agent.epsilon *= agent.epsilon_decay
env.close()
return {
'total_rewards': total_rewards,
'no_of_steps': no_of_steps,
'success_count': success_count,
'frames': frames
}
DQN_results = train_DQNAgent()
Episode 0 | Total Reward: -1442.58 | MaxQ_Action_Count: 0 | Epsilon: 1.0000 | Elapsed: 0:00:00.581677 Episode 10 | Total Reward: -875.69 | MaxQ_Action_Count: 34 | Epsilon: 0.8171 | Elapsed: 0:00:00.584059 Episode 20 | Total Reward: -894.80 | MaxQ_Action_Count: 66 | Epsilon: 0.6676 | Elapsed: 0:00:00.685368 Episode 30 | Total Reward: -889.94 | MaxQ_Action_Count: 91 | Epsilon: 0.5455 | Elapsed: 0:00:00.568242 Episode 40 | Total Reward: -379.12 | MaxQ_Action_Count: 124 | Epsilon: 0.4457 | Elapsed: 0:00:00.567993 Episode 50 | Total Reward: -490.62 | MaxQ_Action_Count: 132 | Epsilon: 0.3642 | Elapsed: 0:00:00.650391 Episode 60 | Total Reward: -376.69 | MaxQ_Action_Count: 143 | Epsilon: 0.2976 | Elapsed: 0:00:00.580481 Episode 70 | Total Reward: -373.07 | MaxQ_Action_Count: 154 | Epsilon: 0.2431 | Elapsed: 0:00:00.584176 Episode 80 | Total Reward: -124.28 | MaxQ_Action_Count: 162 | Epsilon: 0.1986 | Elapsed: 0:00:00.567040 Episode 90 | Total Reward: -892.51 | MaxQ_Action_Count: 163 | Epsilon: 0.1623 | Elapsed: 0:00:00.585132 Episode 100 | Total Reward: -365.75 | MaxQ_Action_Count: 172 | Epsilon: 0.1326 | Elapsed: 0:00:00.581452 Episode 110 | Total Reward: -124.99 | MaxQ_Action_Count: 186 | Epsilon: 0.1084 | Elapsed: 0:00:00.577302 Episode 120 | Total Reward: -251.45 | MaxQ_Action_Count: 189 | Epsilon: 0.0885 | Elapsed: 0:00:00.607042 Episode 130 | Total Reward: -615.79 | MaxQ_Action_Count: 186 | Epsilon: 0.0723 | Elapsed: 0:00:00.617960 Episode 140 | Total Reward: -252.02 | MaxQ_Action_Count: 190 | Epsilon: 0.0591 | Elapsed: 0:00:00.432549 Episode 150 | Total Reward: -245.99 | MaxQ_Action_Count: 192 | Epsilon: 0.0483 | Elapsed: 0:00:00.503967 Episode 160 | Total Reward: -124.51 | MaxQ_Action_Count: 191 | Epsilon: 0.0395 | Elapsed: 0:00:00.372844 Episode 170 | Total Reward: -122.16 | MaxQ_Action_Count: 193 | Epsilon: 0.0322 | Elapsed: 0:00:00.554304 Episode 180 | Total Reward: -238.61 | MaxQ_Action_Count: 196 | Epsilon: 0.0263 | Elapsed: 0:00:00.478303 Episode 190 | Total Reward: -492.15 | MaxQ_Action_Count: 197 | Epsilon: 0.0215 | Elapsed: 0:00:00.633898 Episode 200 | Total Reward: -124.90 | MaxQ_Action_Count: 199 | Epsilon: 0.0176 | Elapsed: 0:00:00.591285 Episode 210 | Total Reward: -244.92 | MaxQ_Action_Count: 197 | Epsilon: 0.0144 | Elapsed: 0:00:00.589859 Episode 220 | Total Reward: -1.64 | MaxQ_Action_Count: 200 | Epsilon: 0.0117 | Elapsed: 0:00:00.633756 Episode 230 | Total Reward: -357.21 | MaxQ_Action_Count: 198 | Epsilon: 0.0096 | Elapsed: 0:00:00.639592 Episode 240 | Total Reward: -1.74 | MaxQ_Action_Count: 200 | Epsilon: 0.0078 | Elapsed: 0:00:00.656186 Episode 250 | Total Reward: -245.18 | MaxQ_Action_Count: 199 | Epsilon: 0.0064 | Elapsed: 0:00:00.657843 Episode 260 | Total Reward: -236.65 | MaxQ_Action_Count: 199 | Epsilon: 0.0052 | Elapsed: 0:00:00.621576 Episode 270 | Total Reward: -367.15 | MaxQ_Action_Count: 200 | Epsilon: 0.0043 | Elapsed: 0:00:00.773133 Episode 280 | Total Reward: -237.50 | MaxQ_Action_Count: 198 | Epsilon: 0.0035 | Elapsed: 0:00:00.638295 Episode 290 | Total Reward: -2.32 | MaxQ_Action_Count: 200 | Epsilon: 0.0029 | Elapsed: 0:00:00.645979 Episode 300 | Total Reward: -729.79 | MaxQ_Action_Count: 200 | Epsilon: 0.0023 | Elapsed: 0:00:00.617639 Episode 310 | Total Reward: -754.04 | MaxQ_Action_Count: 199 | Epsilon: 0.0019 | Elapsed: 0:00:00.617213 Episode 320 | Total Reward: -608.00 | MaxQ_Action_Count: 200 | Epsilon: 0.0016 | Elapsed: 0:00:00.587593 Episode 330 | Total Reward: -127.17 | MaxQ_Action_Count: 200 | Epsilon: 0.0013 | Elapsed: 0:00:00.607517 Episode 340 | Total Reward: -238.76 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.683272 Episode 350 | Total Reward: -1.74 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.693084 Episode 360 | Total Reward: -2.97 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.675632 Episode 370 | Total Reward: -247.95 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.658638 Episode 380 | Total Reward: -121.05 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.615279 Episode 390 | Total Reward: -369.79 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.940296 Episode 400 | Total Reward: -674.26 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.812862 Episode 410 | Total Reward: -122.86 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.834430 Episode 420 | Total Reward: -126.15 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.763558 Episode 430 | Total Reward: -125.37 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.719899 Episode 440 | Total Reward: -123.28 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.722948 Episode 450 | Total Reward: -366.78 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.698313 Episode 460 | Total Reward: -2.89 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.606581 Episode 470 | Total Reward: -2.53 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.628375 Episode 480 | Total Reward: -245.80 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.615340 Episode 490 | Total Reward: -126.18 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.612532 Episode 500 | Total Reward: -126.28 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.616444 Episode 510 | Total Reward: -127.19 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.590390 Episode 520 | Total Reward: -125.42 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.782460 Episode 530 | Total Reward: -374.73 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.579272 Episode 540 | Total Reward: -484.54 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.592744 Episode 550 | Total Reward: -125.92 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.721086 Episode 560 | Total Reward: -124.02 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.684132 Episode 570 | Total Reward: -354.03 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.993130 Episode 580 | Total Reward: -366.26 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:01.313442 Episode 590 | Total Reward: -122.87 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.807287 Episode 600 | Total Reward: -123.00 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.886123 Episode 610 | Total Reward: -128.40 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.903556 Episode 620 | Total Reward: -129.08 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.935696 Episode 630 | Total Reward: -485.23 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.846403 Episode 640 | Total Reward: -127.66 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.915433 Episode 650 | Total Reward: -629.76 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.798961 Episode 660 | Total Reward: -362.97 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.805026 Episode 670 | Total Reward: -369.98 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.960277 Episode 680 | Total Reward: -3.53 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.938893 Episode 690 | Total Reward: -364.13 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.797671 Episode 700 | Total Reward: -126.48 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.818296 Episode 710 | Total Reward: -734.94 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.822243 Episode 720 | Total Reward: -371.50 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.789885 Episode 730 | Total Reward: -486.12 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.822754 Episode 740 | Total Reward: -485.64 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.772636 Episode 750 | Total Reward: -619.11 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.801603 Episode 760 | Total Reward: -380.69 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.843347 Episode 770 | Total Reward: -366.47 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.802733 Episode 780 | Total Reward: -255.12 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.889237 Episode 790 | Total Reward: -579.27 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.818417
VISUALIZING THE PERFORMANCE OF SIMPLE DQN MODEL
- From our Simple DQN Model, we can see that the agent performed significantly better than the Random Action baseline model as there is visible improvement in the rewards obtained by the DQN model. This improvement indicates that the agent is learning effectively from its experiences and is gradually refining its policy to better address the task at hand, which is to balance the pendulum upright vertically.
HOW WILL WE IMPROVE THIS MODEL'S PERFORMANCE?
- In the next section, we will be attempting to refine the action space by adjusting the number of discrete action spaces. We will also be looking at adjusting the learning rate and epsilon value, since our goal is to maintain the pendulum's state over time.
- Aside from the above, we will manually tweak the neural network's architecture to see if adjusting the filters or adding more layers can lead to improved performance.
# Calculating statistical measures
average_reward = np.mean(DQN_results['total_rewards'])
median_reward = np.median(DQN_results['total_rewards'])
max_reward = np.max(DQN_results['total_rewards'])
min_reward = np.min(DQN_results['total_rewards'])
# Identifying the best episode
best_episode_index = np.argmax(DQN_results['total_rewards'])
# Printing the Statistics
print("Performance Statistics for the Simple DQN Model:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(DQN_results['total_rewards'], average_reward, model_name="Simple DQN")
Performance Statistics for the Simple DQN Model: -------------------------------------------- Best Episode : 138 Average Reward : -347.25 Median Reward : -252.25 Maximum Reward : -1.52 Minimum Reward : -1775.87
VIEWING THE MODEL ARCHITECTURE AND PENDULUM ANIMATION
- Now, we will be looking at the model architecture used to train the DQN agent using the
.eval()function for PyTorch. - We will also be viewing the animation of the pendulum's movement and visualize how the pendulum behaves.
# Load and view the model's architecture used for DQN
trained_model = DQNAgent()
trained_model.Q.load_state_dict(torch.load("DQN/DQN138.pt"))
trained_model.Q.eval()
QNetwork( (fc_1): Linear(in_features=3, out_features=64, bias=True) (fc_2): Linear(in_features=64, out_features=32, bias=True) (fc_out): Linear(in_features=32, out_features=9, bias=True) )
TESTING OUR MODEL WEIGHTS
- There is no training involved
- It is to see if the saved model weights can keep the pendulum inverted
class DQNTestAgent:
def __init__(self, weight_file_path):
self.state_dim = 3
self.action_dim = 9
self.lr = 0.01
self.trained_model = weight_file_path
self.Q = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q.load_state_dict(torch.load(self.trained_model))
def choose_action(self, state):
with torch.no_grad():
action = float(torch.argmax(self.Q(state)).numpy())
real_action = (action - 4) / 2
return real_action
agent = DQNTestAgent('DQN/DQN138.pt')
test_agent(agent, 'Simple DQN')
Test reward: -127.04517067648156
MODEL TRAINING EVOLUTION
- Visualize how the model has improved over each episode
# Visualizing the pendulum's animation
create_animation(DQN_results['frames'])
MODEL 3 : ENHANCED DQN MODEL¶
- Now that we have experimented with a Simple DQN model, we can see that it already performs relatively well at balancing the pendulum upright.
- Hence, for the enhanced DQN model, we will be introducing improvements to the model to attempt at getting a better maximum score for the pendulum task in this model and improve the average reward obtained by the agent. Essentially, we want to reduce the penalty.
We will mainly be exploring the following changes:
- Adding one more layer to the neural network (Deepening the QNetwork Model).
- Increasing the
action_dimfrom 9 to 15 (Increasing the number of discretized actions the pendulum can perform). - Reducing the learning rate (by 0.009) and increasing the epsilon value (by 0.5).
# Defining the QNetwork class for the DQN Agent
class ImprovedQNetwork(nn.Module):
def __init__(self, state_dim, action_dim, q_lr):
super(ImprovedQNetwork, self).__init__()
self.fc_1 = nn.Linear(state_dim, 64)
self.fc_2 = nn.Linear(64, 32)
self.fc_3 = nn.Linear(32, 16) # Added another layer to the network
self.fc_out = nn.Linear(16, action_dim)
self.lr = q_lr
self.optimizer = optim.Adam(self.parameters(), lr=self.lr)
def forward(self, x):
q = F.leaky_relu(self.fc_1(x))
q = F.leaky_relu(self.fc_2(q))
q = F.leaky_relu(self.fc_3(q))
q = self.fc_out(q)
return q
# Creating a class for the DQN Agent
class ImprovedDQNAgent:
def __init__(self):
self.state_dim = 3
self.action_dim = 15 # Increased discretization of the action space
self.lr = 0.001 # Modified learning rate value by reducing it
self.gamma = 0.98
self.tau = 0.01
self.epsilon = 1.5 # Modified epsilon value by 0.5
self.epsilon_decay = 0.98
self.epsilon_min = 0.001
self.buffer_size = 100000
self.batch_size = 200
self.memory = ReplayBuffer(self.buffer_size)
self.Q = ImprovedQNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target = ImprovedQNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target.load_state_dict(self.Q.state_dict())
def choose_action(self, state):
random_number = np.random.rand()
maxQ_action_count = 0
if self.epsilon < random_number:
with torch.no_grad():
action = float(torch.argmax(self.Q(state)).numpy())
real_action = (action - 4) / 4
maxQ_action_count = 1
else:
action = np.random.choice([n for n in range(9)])
real_action = (action - 4) / 2
return action, real_action, maxQ_action_count
def calc_target(self, mini_batch):
s, a, r, s_prime, done = mini_batch
with torch.no_grad():
q_target = self.Q_target(s_prime).max(1)[0].unsqueeze(1)
target = r + self.gamma * done * q_target
return target
def train_agent(self):
mini_batch = self.memory.sample(self.batch_size)
s_batch, a_batch, r_batch, s_prime_batch, done_batch = mini_batch
a_batch = a_batch.type(torch.int64)
td_target = self.calc_target(mini_batch)
# QNetwork training
Q_a = self.Q(s_batch).gather(1, a_batch)
q_loss = F.smooth_l1_loss(Q_a, td_target)
self.Q.optimizer.zero_grad()
q_loss.mean().backward()
self.Q.optimizer.step()
# QNetwork Soft Update
for param_target, param in zip(self.Q_target.parameters(), self.Q.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)
def train_ImprovedDQNAgent():
# Initalize the DQN Agent and related variables required
agent = ImprovedDQNAgent()
env = gym.make('Pendulum-v1', g=9.81)
episodes = 800
total_rewards = []
frames = []
no_of_steps = []
success_count = 0
best_episode = 0
best_reward = float('-inf')
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
maxQ_action_count = 0
start_time = datetime.datetime.now()
while not done:
action, real_action, count = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([real_action])
agent.memory.put((state, action, reward, state_prime, done))
score += reward
maxQ_action_count += count
state = state_prime
if maxQ_action_count % 100 == 0 and score > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if agent.memory.size() > 1000:
agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200
total_rewards.append(score)
no_of_steps.append(maxQ_action_count)
if score > best_reward:
best_reward = score
best_episode = episode
# Saving the Models
save_folder = "IMPROVED DQN"
if not os.path.exists(save_folder):
os.makedirs(save_folder)
if episode == best_episode:
model_name = os.path.join(save_folder, "IMPROVED_DQN" + str(episode) + ".pt")
torch.save(agent.Q.state_dict(), model_name)
if episode % 10 == 0:
elapsed_time = datetime.datetime.now() - start_time
print('Episode {:>4} | Total Reward: {:>8.2f} | MaxQ_Action_Count:{:>5} | Epsilon: {:>4.4f} | Elapsed: {}'.format(episode, score, maxQ_action_count, agent.epsilon, elapsed_time))
if agent.epsilon > agent.epsilon_min:
agent.epsilon *= agent.epsilon_decay
env.close()
return {
'total_rewards': total_rewards,
'no_of_steps': no_of_steps,
'success_count': success_count,
'frames': frames
}
ImprovedDQN_results = train_ImprovedDQNAgent()
Episode 0 | Total Reward: -1701.66 | MaxQ_Action_Count: 0 | Epsilon: 1.5000 | Elapsed: 0:00:00.465424 Episode 10 | Total Reward: -1765.65 | MaxQ_Action_Count: 0 | Epsilon: 1.2256 | Elapsed: 0:00:00.709585 Episode 20 | Total Reward: -1346.81 | MaxQ_Action_Count: 0 | Epsilon: 1.0014 | Elapsed: 0:00:00.650730 Episode 30 | Total Reward: -1693.30 | MaxQ_Action_Count: 31 | Epsilon: 0.8182 | Elapsed: 0:00:00.660949 Episode 40 | Total Reward: -647.31 | MaxQ_Action_Count: 64 | Epsilon: 0.6686 | Elapsed: 0:00:00.625166 Episode 50 | Total Reward: -760.08 | MaxQ_Action_Count: 79 | Epsilon: 0.5463 | Elapsed: 0:00:00.714982 Episode 60 | Total Reward: -908.27 | MaxQ_Action_Count: 119 | Epsilon: 0.4463 | Elapsed: 0:00:00.623377 Episode 70 | Total Reward: -253.43 | MaxQ_Action_Count: 136 | Epsilon: 0.3647 | Elapsed: 0:00:00.690183 Episode 80 | Total Reward: -842.77 | MaxQ_Action_Count: 124 | Epsilon: 0.2980 | Elapsed: 0:00:00.662831 Episode 90 | Total Reward: -362.60 | MaxQ_Action_Count: 148 | Epsilon: 0.2435 | Elapsed: 0:00:00.648018 Episode 100 | Total Reward: -242.11 | MaxQ_Action_Count: 162 | Epsilon: 0.1989 | Elapsed: 0:00:00.686762 Episode 110 | Total Reward: -246.22 | MaxQ_Action_Count: 172 | Epsilon: 0.1625 | Elapsed: 0:00:00.706575 Episode 120 | Total Reward: -253.04 | MaxQ_Action_Count: 179 | Epsilon: 0.1328 | Elapsed: 0:00:00.731047 Episode 130 | Total Reward: -125.22 | MaxQ_Action_Count: 176 | Epsilon: 0.1085 | Elapsed: 0:00:00.655017 Episode 140 | Total Reward: -248.83 | MaxQ_Action_Count: 172 | Epsilon: 0.0887 | Elapsed: 0:00:00.625905 Episode 150 | Total Reward: -355.15 | MaxQ_Action_Count: 190 | Epsilon: 0.0724 | Elapsed: 0:00:00.651852 Episode 160 | Total Reward: -122.48 | MaxQ_Action_Count: 188 | Epsilon: 0.0592 | Elapsed: 0:00:00.685368 Episode 170 | Total Reward: -124.56 | MaxQ_Action_Count: 190 | Epsilon: 0.0484 | Elapsed: 0:00:00.661211 Episode 180 | Total Reward: -120.08 | MaxQ_Action_Count: 196 | Epsilon: 0.0395 | Elapsed: 0:00:00.744368 Episode 190 | Total Reward: -445.75 | MaxQ_Action_Count: 194 | Epsilon: 0.0323 | Elapsed: 0:00:00.734945 Episode 200 | Total Reward: -235.56 | MaxQ_Action_Count: 187 | Epsilon: 0.0264 | Elapsed: 0:00:00.720266 Episode 210 | Total Reward: -120.48 | MaxQ_Action_Count: 197 | Epsilon: 0.0216 | Elapsed: 0:00:00.678907 Episode 220 | Total Reward: -0.84 | MaxQ_Action_Count: 196 | Epsilon: 0.0176 | Elapsed: 0:00:00.691770 Episode 230 | Total Reward: -231.33 | MaxQ_Action_Count: 196 | Epsilon: 0.0144 | Elapsed: 0:00:00.729621 Episode 240 | Total Reward: -124.47 | MaxQ_Action_Count: 199 | Epsilon: 0.0118 | Elapsed: 0:00:00.720605 Episode 250 | Total Reward: -368.18 | MaxQ_Action_Count: 199 | Epsilon: 0.0096 | Elapsed: 0:00:00.706631 Episode 260 | Total Reward: -252.77 | MaxQ_Action_Count: 198 | Epsilon: 0.0079 | Elapsed: 0:00:00.736654 Episode 270 | Total Reward: -122.60 | MaxQ_Action_Count: 199 | Epsilon: 0.0064 | Elapsed: 0:00:00.672207 Episode 280 | Total Reward: -239.90 | MaxQ_Action_Count: 200 | Epsilon: 0.0052 | Elapsed: 0:00:00.714770 Episode 290 | Total Reward: -126.78 | MaxQ_Action_Count: 200 | Epsilon: 0.0043 | Elapsed: 0:00:00.733234 Episode 300 | Total Reward: -123.97 | MaxQ_Action_Count: 200 | Epsilon: 0.0035 | Elapsed: 0:00:00.712044 Episode 310 | Total Reward: -246.68 | MaxQ_Action_Count: 198 | Epsilon: 0.0029 | Elapsed: 0:00:00.663007 Episode 320 | Total Reward: -124.61 | MaxQ_Action_Count: 200 | Epsilon: 0.0023 | Elapsed: 0:00:00.702435 Episode 330 | Total Reward: -119.87 | MaxQ_Action_Count: 200 | Epsilon: 0.0019 | Elapsed: 0:00:00.719391 Episode 340 | Total Reward: -364.39 | MaxQ_Action_Count: 198 | Epsilon: 0.0016 | Elapsed: 0:00:00.763188 Episode 350 | Total Reward: -390.23 | MaxQ_Action_Count: 200 | Epsilon: 0.0013 | Elapsed: 0:00:00.752954 Episode 360 | Total Reward: -243.66 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.721182 Episode 370 | Total Reward: -126.84 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.774583 Episode 380 | Total Reward: -126.41 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.778766 Episode 390 | Total Reward: -127.95 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.685989 Episode 400 | Total Reward: -125.55 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.897287 Episode 410 | Total Reward: -122.37 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.618388 Episode 420 | Total Reward: -117.74 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.788237 Episode 430 | Total Reward: -124.20 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.736342 Episode 440 | Total Reward: -125.99 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.765104 Episode 450 | Total Reward: -125.99 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.812960 Episode 460 | Total Reward: -355.16 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.737551 Episode 470 | Total Reward: -128.87 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.736732 Episode 480 | Total Reward: -120.41 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:01.098357 Episode 490 | Total Reward: -122.67 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.667135 Episode 500 | Total Reward: -415.92 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.663800 Episode 510 | Total Reward: -366.80 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.658987 Episode 520 | Total Reward: -124.80 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.668457 Episode 530 | Total Reward: -233.67 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.628503 Episode 540 | Total Reward: -237.89 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.694867 Episode 550 | Total Reward: -333.84 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.673737 Episode 560 | Total Reward: -3.03 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.660887 Episode 570 | Total Reward: -274.22 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.638734 Episode 580 | Total Reward: -360.31 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.635958 Episode 590 | Total Reward: -240.42 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.629189 Episode 600 | Total Reward: -123.62 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.666066 Episode 610 | Total Reward: -127.02 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.653596 Episode 620 | Total Reward: -230.49 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.661901 Episode 630 | Total Reward: -126.38 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.614376 Episode 640 | Total Reward: -484.51 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.626510 Episode 650 | Total Reward: -122.01 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.679646 Episode 660 | Total Reward: -124.51 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.576189 Episode 670 | Total Reward: -357.23 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.655225 Episode 680 | Total Reward: -127.42 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.719835 Episode 690 | Total Reward: -126.34 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.691893 Episode 700 | Total Reward: -130.80 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.647632 Episode 710 | Total Reward: -235.25 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.664781 Episode 720 | Total Reward: -2.75 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.676082 Episode 730 | Total Reward: -234.89 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.658749 Episode 740 | Total Reward: -258.37 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.674140 Episode 750 | Total Reward: -2.35 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.619855 Episode 760 | Total Reward: -2.99 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.648557 Episode 770 | Total Reward: -312.24 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.704718 Episode 780 | Total Reward: -124.80 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.611422 Episode 790 | Total Reward: -123.40 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.784165
VISUALIZING THE PERFORMANCE OF THE ENHANCED DQN MODEL
- Our Enhanced DQN Model yielded significant improvements compared to the Simple DQN model. The average reward increased notably from -342.95 to -278.45, demonstrating a substantial performance boost. Additionally, the maximum reward achieved by this model improved significantly, reaching -0.30, which is much closer to the optimal reward of 0.
- These results affirm the effectiveness of the changes made in this model, particularly the increase in the number of discretized action spaces for the pendulum to explore.
- Consequently, the successful completion of the Pendulum task is evident through the high rewards obtained after training the model. Moving forward, we will explore a variation of the DQN algorithm - DDQN to assess its potential for achieving even better results in completing the task.
# Calculating statistical measures
average_reward = np.mean(ImprovedDQN_results['total_rewards'])
median_reward = np.median(ImprovedDQN_results['total_rewards'])
max_reward = np.max(ImprovedDQN_results['total_rewards'])
min_reward = np.min(ImprovedDQN_results['total_rewards'])
# Identifying the best episode
best_episode_index = np.argmax(ImprovedDQN_results['total_rewards'])
# Printing the Statistics
print("Performance Statistics for the Improved DQN Model:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(ImprovedDQN_results['total_rewards'], average_reward, model_name="Improved DQN")
Performance Statistics for the Improved DQN Model: -------------------------------------------- Best Episode : 172 Average Reward : -265.13 Median Reward : -130.56 Maximum Reward : -0.60 Minimum Reward : -1796.05
VIEWING THE MODEL ARCHITECTURE AND PENDULUM ANIMATION
- Now, we will be looking at the model architecture used to train the improved DQN agent using the
.eval()function for PyTorch. - We will also be viewing the animation of the pendulum's movement and visualize how the pendulum behaves.
# Load and view the model's architecture used for DQN
trained_model = ImprovedDQNAgent()
trained_model.Q.load_state_dict(torch.load("IMPROVED DQN/IMPROVED_DQN172.pt"))
trained_model.Q.eval()
ImprovedQNetwork( (fc_1): Linear(in_features=3, out_features=64, bias=True) (fc_2): Linear(in_features=64, out_features=32, bias=True) (fc_3): Linear(in_features=32, out_features=16, bias=True) (fc_out): Linear(in_features=16, out_features=15, bias=True) )
TESTING OUR MODEL WEIGHTS
- There is no training involved
- It is to see if the saved model weights can keep the pendulum inverted
# Creating a class for the DQN Agent
class ImprovedDQNTestAgent:
def __init__(self, weight_file_path):
self.state_dim = 3
self.action_dim = 15 # Increased discretization of the action space
self.lr = 0.001 # Modified learning rate value by reducing it
self.trained_model = weight_file_path
self.Q = ImprovedQNetwork(self.state_dim, self.action_dim, self.lr)
self.Q.load_state_dict(torch.load(self.trained_model))
def choose_action(self, state):
with torch.no_grad():
action = float(torch.argmax(self.Q(state)).numpy())
real_action = (action - 4) / 2
return real_action
agent = ImprovedDQNTestAgent('IMPROVED DQN/IMPROVED_DQN172.pt')
test_agent(agent, 'Improved DQN')
Test reward: -120.75985653386435
MODEL TRAINING EVOLUTION
- Visualize how the model has improved over each episode
# Visualizing the pendulum's animation
create_animation(ImprovedDQN_results['frames'])
MODEL 4 : DOUBLE DEEP-Q NETWORK (DDQN)¶
The Double Deep-Q Network (DDQN) is an advanced reinforcement learning model that builds upon the architecture of the Deep-Q Network (DQN). It addresses a critical shortcoming in the DQN, namely the overestimation of action values due to the same network being used for both selecting and evaluating an action.
- Two Neural Networks: DDQN utilizes two distinct neural networks with identical architectures. The first network, called the evaluation network, is used for selecting the best action given the current state. The second network, known as the target network, is used for evaluating the action's value.
- Delayed Target Network Updates: The target network's weights are periodically updated with the weights of the evaluation network. This delayed update, as opposed to updating after every learning step as in DQN, helps in stabilizing the learning process.
- Action Selection and Evaluation Separation: In DDQN, the action is chosen using the evaluation network, but its value is estimated using the target network. This separation reduces the risk of overoptimistic value estimates, a problem common in the standard DQN.
WHAT ARE THE ADVANTAGES OF DDQN?
The Double Deep-Q Network (DDQN) offers significant advantages over the traditional Deep-Q Network (DQN) in terms of learning accuracy and stability. By separating action selection and value estimation between two neural networks, DDQN effectively reduces the overestimation bias common in DQNs. This separation ensures more reliable and stable learning outcomes. Additionally, the strategy of using delayed updates for the target network contributes to the overall stability of the learning process. Furthermore, DDQN typically exhibits enhanced performance, especially in environments characterized by noisy or misleading reward signals, demonstrating its superiority in complex learning scenarios.
SETTING UP THE MODEL ARCHITECTURE FOR THE DDQN MODEL
- The parameters used in the DDQN architecture will follow the improved DQN model's parameters, as it was shown to return better rewards. Hence, DDQN is used in the hopes that it can further improve the reward obtained more with stabilization of the Q-values.
Below contains the sections changed to suit DDQN's architecture:
- Action Selection Process : Actions during the agent's decision-making process in the
choose_actionfunction are selected using the target networkself.Q_target. - Target Q-Value Calculation : DDQN uses the main network to select the action but uses the target values instead to estimate the value of that action, which is the opposite of the DQN architecture.
# Defining the QNetwork class for the DDQN Agent
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim, q_lr):
super(QNetwork, self).__init__()
self.fc_1 = nn.Linear(state_dim, 64)
self.fc_2 = nn.Linear(64, 32)
self.fc_3 = nn.Linear(32, 16)
self.fc_out = nn.Linear(16, action_dim)
self.lr = q_lr
self.optimizer = optim.Adam(self.parameters(), lr=self.lr)
def forward(self, x):
q = F.leaky_relu(self.fc_1(x))
q = F.leaky_relu(self.fc_2(q))
q = F.leaky_relu(self.fc_3(q))
q = self.fc_out(q)
return q
# Creating a class for the DDQN Agent
class DDQNAgent:
def __init__(self):
self.state_dim = 3
self.action_dim = 15
self.lr = 0.001
self.gamma = 0.98
self.tau = 0.01
self.epsilon = 1.5
self.epsilon_decay = 0.98
self.epsilon_min = 0.001
self.buffer_size = 100000
self.batch_size = 200
self.memory = ReplayBuffer(self.buffer_size)
self.Q = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target.load_state_dict(self.Q.state_dict())
def choose_action(self, state):
random_number = np.random.rand()
maxQ_action_count = 0
if self.epsilon < random_number:
with torch.no_grad():
# Use Q_target for action selection
action = float(torch.argmax(self.Q_target(state)).numpy())
real_action = (action - 4) / 4
maxQ_action_count = 1
else:
action = np.random.choice([n for n in range(9)])
real_action = (action - 4) / 2
return action, real_action, maxQ_action_count
def calc_target(self, mini_batch):
s, a, r, s_prime, done = mini_batch
with torch.no_grad():
# Use Q for action selection
best_next_action = torch.argmax(self.Q(s_prime), dim=1, keepdim=True)
q_target = self.Q_target(s_prime).gather(1, best_next_action)
target = r + self.gamma * done * q_target
return target
def train_agent(self):
mini_batch = self.memory.sample(self.batch_size)
s_batch, a_batch, r_batch, s_prime_batch, done_batch = mini_batch
a_batch = a_batch.type(torch.int64)
td_target = self.calc_target(mini_batch)
# QNetwork training
Q_a = self.Q(s_batch).gather(1, a_batch)
q_loss = F.smooth_l1_loss(Q_a, td_target)
self.Q.optimizer.zero_grad()
q_loss.mean().backward()
self.Q.optimizer.step()
# QNetwork Soft Update for DDQN
for param_target, param in zip(self.Q_target.parameters(), self.Q.parameters()):
param_target.data.copy_(self.tau * param.data + (1.0 - self.tau) * param_target.data)
def train_DDQNAgent():
# Initalize the DQN Agent and related variables required
agent = DDQNAgent()
env = gym.make('Pendulum-v1', g=9.81)
episodes = 800
total_rewards = []
success_count = 0
no_of_steps = []
frames = []
best_episode = 0
best_reward = float('-inf')
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
maxQ_action_count = 0
start_time = datetime.datetime.now()
while not done:
action, real_action, count = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([real_action])
agent.memory.put((state, action, reward, state_prime, done))
score += reward
maxQ_action_count += count
state = state_prime
if maxQ_action_count % 100 == 0 and score > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if agent.memory.size() > 1000:
agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200
total_rewards.append(score)
no_of_steps.append(maxQ_action_count)
if score > best_reward:
best_reward = score
best_episode = episode
# Saving the Models
save_folder = "DDQN"
if not os.path.exists(save_folder):
os.makedirs(save_folder)
if episode == best_episode:
model_name = os.path.join(save_folder, "DDQN" + str(episode) + ".pt")
torch.save(agent.Q.state_dict(), model_name)
if episode % 10 == 0:
elapsed_time = datetime.datetime.now() - start_time
print('Episode {:>4} | Total Reward: {:>8.2f} | MaxQ_Action_Count:{:>5} | Epsilon: {:>4.4f} | Elapsed: {}'.format(episode, score, maxQ_action_count, agent.epsilon, elapsed_time))
if agent.epsilon > agent.epsilon_min:
agent.epsilon *= agent.epsilon_decay
env.close()
return {
'total_rewards': total_rewards,
'no_of_steps': no_of_steps,
'success_count': success_count,
'frames': frames
}
DDQN_results = train_DDQNAgent()
Episode 0 | Total Reward: -1080.22 | MaxQ_Action_Count: 0 | Epsilon: 1.5000 | Elapsed: 0:00:00.492155 Episode 10 | Total Reward: -784.74 | MaxQ_Action_Count: 0 | Epsilon: 1.2256 | Elapsed: 0:00:00.857798 Episode 20 | Total Reward: -1393.62 | MaxQ_Action_Count: 0 | Epsilon: 1.0014 | Elapsed: 0:00:00.723278 Episode 30 | Total Reward: -972.29 | MaxQ_Action_Count: 32 | Epsilon: 0.8182 | Elapsed: 0:00:00.691218 Episode 40 | Total Reward: -1041.89 | MaxQ_Action_Count: 64 | Epsilon: 0.6686 | Elapsed: 0:00:00.558482 Episode 50 | Total Reward: -628.97 | MaxQ_Action_Count: 80 | Epsilon: 0.5463 | Elapsed: 0:00:00.693278 Episode 60 | Total Reward: -508.24 | MaxQ_Action_Count: 114 | Epsilon: 0.4463 | Elapsed: 0:00:00.717819 Episode 70 | Total Reward: -243.42 | MaxQ_Action_Count: 133 | Epsilon: 0.3647 | Elapsed: 0:00:00.712669 Episode 80 | Total Reward: -122.18 | MaxQ_Action_Count: 144 | Epsilon: 0.2980 | Elapsed: 0:00:00.772529 Episode 90 | Total Reward: -237.63 | MaxQ_Action_Count: 153 | Epsilon: 0.2435 | Elapsed: 0:00:00.689303 Episode 100 | Total Reward: -269.47 | MaxQ_Action_Count: 157 | Epsilon: 0.1989 | Elapsed: 0:00:00.661259 Episode 110 | Total Reward: -360.77 | MaxQ_Action_Count: 175 | Epsilon: 0.1625 | Elapsed: 0:00:00.783340 Episode 120 | Total Reward: -121.18 | MaxQ_Action_Count: 179 | Epsilon: 0.1328 | Elapsed: 0:00:00.762387 Episode 130 | Total Reward: -125.81 | MaxQ_Action_Count: 181 | Epsilon: 0.1085 | Elapsed: 0:00:00.647560 Episode 140 | Total Reward: -233.61 | MaxQ_Action_Count: 181 | Epsilon: 0.0887 | Elapsed: 0:00:00.727607 Episode 150 | Total Reward: -122.27 | MaxQ_Action_Count: 182 | Epsilon: 0.0724 | Elapsed: 0:00:00.534581 Episode 160 | Total Reward: -122.22 | MaxQ_Action_Count: 189 | Epsilon: 0.0592 | Elapsed: 0:00:00.583074 Episode 170 | Total Reward: -123.95 | MaxQ_Action_Count: 191 | Epsilon: 0.0484 | Elapsed: 0:00:00.576898 Episode 180 | Total Reward: -124.39 | MaxQ_Action_Count: 189 | Epsilon: 0.0395 | Elapsed: 0:00:00.561412 Episode 190 | Total Reward: -382.09 | MaxQ_Action_Count: 194 | Epsilon: 0.0323 | Elapsed: 0:00:00.572251 Episode 200 | Total Reward: -121.65 | MaxQ_Action_Count: 195 | Epsilon: 0.0264 | Elapsed: 0:00:00.587561 Episode 210 | Total Reward: -120.60 | MaxQ_Action_Count: 198 | Epsilon: 0.0216 | Elapsed: 0:00:00.575041 Episode 220 | Total Reward: -114.98 | MaxQ_Action_Count: 199 | Epsilon: 0.0176 | Elapsed: 0:00:00.683037 Episode 230 | Total Reward: -122.13 | MaxQ_Action_Count: 200 | Epsilon: 0.0144 | Elapsed: 0:00:00.763769 Episode 240 | Total Reward: -246.70 | MaxQ_Action_Count: 199 | Epsilon: 0.0118 | Elapsed: 0:00:00.580312 Episode 250 | Total Reward: -231.83 | MaxQ_Action_Count: 199 | Epsilon: 0.0096 | Elapsed: 0:00:01.545827 Episode 260 | Total Reward: -234.21 | MaxQ_Action_Count: 199 | Epsilon: 0.0079 | Elapsed: 0:00:00.588389 Episode 270 | Total Reward: -378.82 | MaxQ_Action_Count: 198 | Epsilon: 0.0064 | Elapsed: 0:00:00.604841 Episode 280 | Total Reward: -119.93 | MaxQ_Action_Count: 200 | Epsilon: 0.0052 | Elapsed: 0:00:00.645680 Episode 290 | Total Reward: -125.49 | MaxQ_Action_Count: 199 | Epsilon: 0.0043 | Elapsed: 0:00:00.610931 Episode 300 | Total Reward: -243.49 | MaxQ_Action_Count: 199 | Epsilon: 0.0035 | Elapsed: 0:00:00.599383 Episode 310 | Total Reward: -242.24 | MaxQ_Action_Count: 200 | Epsilon: 0.0029 | Elapsed: 0:00:00.677754 Episode 320 | Total Reward: -355.14 | MaxQ_Action_Count: 198 | Epsilon: 0.0023 | Elapsed: 0:00:00.603506 Episode 330 | Total Reward: -120.02 | MaxQ_Action_Count: 200 | Epsilon: 0.0019 | Elapsed: 0:00:00.639646 Episode 340 | Total Reward: -123.97 | MaxQ_Action_Count: 200 | Epsilon: 0.0016 | Elapsed: 0:00:00.577731 Episode 350 | Total Reward: -1.11 | MaxQ_Action_Count: 199 | Epsilon: 0.0013 | Elapsed: 0:00:00.622338 Episode 360 | Total Reward: -245.18 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.641010 Episode 370 | Total Reward: -233.93 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.608609 Episode 380 | Total Reward: -125.98 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.616386 Episode 390 | Total Reward: -248.56 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.637242 Episode 400 | Total Reward: -2.12 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.649135 Episode 410 | Total Reward: -244.65 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.613399 Episode 420 | Total Reward: -127.14 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.620103 Episode 430 | Total Reward: -1.47 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.671880 Episode 440 | Total Reward: -115.95 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.691710 Episode 450 | Total Reward: -124.82 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.657961 Episode 460 | Total Reward: -126.57 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.621857 Episode 470 | Total Reward: -252.93 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.626950 Episode 480 | Total Reward: -1.16 | MaxQ_Action_Count: 198 | Epsilon: 0.0010 | Elapsed: 0:00:00.706426 Episode 490 | Total Reward: -117.87 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.881479 Episode 500 | Total Reward: -246.50 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.857655 Episode 510 | Total Reward: -128.50 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.881455 Episode 520 | Total Reward: -127.42 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.940179 Episode 530 | Total Reward: -123.36 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.805455 Episode 540 | Total Reward: -124.35 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.808052 Episode 550 | Total Reward: -119.67 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.763212 Episode 560 | Total Reward: -128.98 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.776317 Episode 570 | Total Reward: -248.52 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.783325 Episode 580 | Total Reward: -123.06 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.851112 Episode 590 | Total Reward: -120.66 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.885875 Episode 600 | Total Reward: -235.40 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.808629 Episode 610 | Total Reward: -1.60 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.828082 Episode 620 | Total Reward: -2.54 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.772766 Episode 630 | Total Reward: -238.89 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.802408 Episode 640 | Total Reward: -121.04 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.853356 Episode 650 | Total Reward: -126.19 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.837425 Episode 660 | Total Reward: -124.32 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.766248 Episode 670 | Total Reward: -124.67 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.783132 Episode 680 | Total Reward: -0.32 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.706073 Episode 690 | Total Reward: -360.14 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.624401 Episode 700 | Total Reward: -124.49 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.622957 Episode 710 | Total Reward: -0.72 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.673523 Episode 720 | Total Reward: -362.04 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.629440 Episode 730 | Total Reward: -383.93 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.625617 Episode 740 | Total Reward: -231.58 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.659562 Episode 750 | Total Reward: -364.26 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.627990 Episode 760 | Total Reward: -124.43 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.653057 Episode 770 | Total Reward: -118.20 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.639233 Episode 780 | Total Reward: -352.28 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.665880 Episode 790 | Total Reward: -1.09 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.697581
VISUALIZING THE PERFORMANCE OF THE DOUBLE DQN MODEL
- From our Double-DQN model, we can see that its performance in comparison to the Improved DQN Model is actually roughly the same, with the exception that it was able to achieve a slightly better average reward value of -276.94 compared to the Improved DQN Model with -278.45.
- The slightly improved perfromance of the Double-DQN model could possibly be attributed to its ability to address overestimation bias present in traditional DQN algorithms.
- One reason for the similar performance could be due to the Pendulum environment being relatively simple, with simple state and action spaces, hence the benefits of advanced techniques like Double-DQN may not be as pronounced as in more complex tasks
# Calculating statistical measures
average_reward = np.mean(DDQN_results['total_rewards'])
median_reward = np.median(DDQN_results['total_rewards'])
max_reward = np.max(DDQN_results['total_rewards'])
min_reward = np.min(DDQN_results['total_rewards'])
# Identifying the best episode
best_episode_index = np.argmax(DDQN_results['total_rewards'])
# Printing the Statistics
print("Performance Statistics for the Double DQN Model:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(DDQN_results['total_rewards'], average_reward, model_name="Double DQN")
Performance Statistics for the Double DQN Model: -------------------------------------------- Best Episode : 776 Average Reward : -261.91 Median Reward : -129.09 Maximum Reward : -0.26 Minimum Reward : -1756.07
VIEWING THE MODEL ARCHITECTURE AND PENDULUM ANIMATION
- Now, we will be looking at the model architecture used to train the Double DQN agent using the
.eval()function for PyTorch. - We will also be viewing the animation of the pendulum's movement and visualize how the pendulum behaves.
# Load and view the model's architecture used for DDQN
trained_model = DDQNAgent()
trained_model.Q.load_state_dict(torch.load("DDQN/DDQN776.pt"))
trained_model.Q.eval()
QNetwork( (fc_1): Linear(in_features=3, out_features=64, bias=True) (fc_2): Linear(in_features=64, out_features=32, bias=True) (fc_3): Linear(in_features=32, out_features=16, bias=True) (fc_out): Linear(in_features=16, out_features=15, bias=True) )
TESTING OUR MODEL WEIGHTS
- There is no training involved
- It is to see if the saved model weights can keep the pendulum inverted
class DDQNTestAgent:
def __init__(self, weight_file_path):
self.state_dim = 3
self.action_dim = 15
self.lr = 0.001
self.trained_model = weight_file_path
self.Q = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q.load_state_dict(torch.load(self.trained_model))
def choose_action(self, state):
with torch.no_grad():
action = float(torch.argmax(self.Q(state)).numpy())
real_action = (action - 4) / 2
return real_action
agent = DDQNTestAgent("DDQN/DDQN776.pt")
test_agent(agent, 'DDQN')
Test reward: -367.44327135316513
MODEL TRAINING EVOLUTION
- Visualize how the model has improved over each episode
# Visualizing the pendulum's animation
create_animation(DDQN_results['frames'])
MODEL 5 : SOFT ACTOR-CRITIC NETWORK (SAC)¶
The Soft Actor-Critic Network is an agent that employs a stochastic policy for action selection, enabling it to capture the inherent uncertainty in many real-world environments. This stochasticity helps SAC to explore better and handle environments with continuous action spaces, which is suitable in the case of the Pendulum task.
For SAC, it introduces an entropy term into the objective function. This term encourages the policy to take actions that are not only rewarding but also diverse. It prevents premature convergence to suboptimal policies and aids in exploration. At the same time, SAC uses a soft value function, allowing it to handle both continuous and discrete action spaces seamlessly.
WHAT ARE THE ADVANTAGES OF SAC?
Stochastic Policies: SAC's use of stochastic policies allows for better exploration, especially in environments with continuous action spaces, where deterministic policies may struggle.
Entropy Regularization: The inclusion of an entropy regularization term encourages diverse actions and robust exploration, preventing the algorithm from getting stuck in suboptimal solutions.
Sample Efficiency: Being an off-policy algorithm, SAC can make more efficient use of past experiences, reducing the need for extensive interaction with the environment.
Versatility: SAC can handle both continuous and discrete action spaces, making it suitable for a wide range of reinforcement learning tasks.
Actor-Critic Separation: Separating the actor and critic networks reduces overestimation bias and contributes to more stable learning.
SETTING UP THE MODEL ARCHITECTURE FOR THE SAC MODEL
# Defining the PolicyNetwork class for the SAC Agent
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, actor_lr):
super(PolicyNetwork, self).__init__()
self.fc_1 = nn.Linear(state_dim, 64)
self.fc_2 = nn.Linear(64, 64)
self.fc_mu = nn.Linear(64, action_dim)
self.fc_std = nn.Linear(64, action_dim)
self.lr = actor_lr
self.LOG_STD_MIN = -20
self.LOG_STD_MAX = 2
self.max_action = 2
self.min_action = -2
self.action_scale = (self.max_action - self.min_action) / 2.0
self.action_bias = (self.max_action + self.min_action) / 2.0
self.optimizer = optim.Adam(self.parameters(), lr=self.lr)
def forward(self, x):
x = F.leaky_relu(self.fc_1(x))
x = F.leaky_relu(self.fc_2(x))
mu = self.fc_mu(x)
log_std = self.fc_std(x)
log_std = torch.clamp(log_std, self.LOG_STD_MIN, self.LOG_STD_MAX)
return mu, log_std
def sample(self, state):
mean, log_std = self.forward(state)
std = torch.exp(log_std)
reparameter = Normal(mean, std)
x_t = reparameter.rsample()
y_t = torch.tanh(x_t)
action = self.action_scale * y_t + self.action_bias
# # Enforcing Action Bound
log_prob = reparameter.log_prob(x_t)
log_prob = log_prob - torch.sum(torch.log(self.action_scale * (1 - y_t.pow(2)) + 1e-6), dim=-1, keepdim=True)
return action, log_prob
# Defining the QNetwork class for the SAC Agent
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim, critic_lr):
super(QNetwork, self).__init__()
self.fc_s = nn.Linear(state_dim, 32)
self.fc_a = nn.Linear(action_dim, 32)
self.fc_1 = nn.Linear(64, 64)
self.fc_out = nn.Linear(64, action_dim)
self.lr = critic_lr
self.optimizer = optim.Adam(self.parameters(), lr=self.lr)
def forward(self, x, a):
h1 = F.leaky_relu(self.fc_s(x))
h2 = F.leaky_relu(self.fc_a(a))
cat = torch.cat([h1, h2], dim=-1)
q = F.leaky_relu(self.fc_1(cat))
q = self.fc_out(q)
return q
# Creating and defining the SAC Agent
class SACAgent:
def __init__(self):
self.state_dim = 3
self.action_dim = 1
self.lr_pi = 0.001
self.lr_q = 0.001
self.gamma = 0.98
self.batch_size = 200
self.buffer_limit = 100000
self.tau = 0.005
self.init_alpha = 0.01
self.target_entropy = -self.action_dim
self.lr_alpha = 0.005
self.memory = ReplayBuffer(self.buffer_limit)
self.log_alpha = torch.tensor(np.log(self.init_alpha))
self.log_alpha.requires_grad = True
self.log_alpha_optimizer = optim.Adam([self.log_alpha], lr=self.lr_alpha)
self.PI = PolicyNetwork(self.state_dim, self.action_dim, self.lr_pi)
self.Q1 = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q1_target = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q2 = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q2_target = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q1_target.load_state_dict(self.Q1.state_dict())
self.Q2_target.load_state_dict(self.Q2.state_dict())
def choose_action(self, s):
with torch.no_grad():
action, log_prob = self.PI.sample(s)
return action, log_prob
def calc_target(self, mini_batch):
s, a, r, s_prime, done = mini_batch
with torch.no_grad():
a_prime, log_prob_prime = self.PI.sample(s_prime)
entropy = - self.log_alpha.exp() * log_prob_prime
q1_target, q2_target = self.Q1_target(s_prime, a_prime), self.Q2_target(s_prime, a_prime)
q_target = torch.min(q1_target, q2_target)
target = r + self.gamma * done * (q_target + entropy)
return target
def train_agent(self):
mini_batch = self.memory.sample(self.batch_size)
s_batch, a_batch, r_batch, s_prime_batch, done_batch = mini_batch
td_target = self.calc_target(mini_batch)
# Training of Q1
q1_loss = F.smooth_l1_loss(self.Q1(s_batch, a_batch), td_target)
self.Q1.optimizer.zero_grad()
q1_loss.mean().backward()
self.Q1.optimizer.step()
# Training of Q2
q2_loss = F.smooth_l1_loss(self.Q2(s_batch, a_batch), td_target)
self.Q2.optimizer.zero_grad()
q2_loss.mean().backward()
self.Q2.optimizer.step()
# Training of PI
a, log_prob = self.PI.sample(s_batch)
entropy = -self.log_alpha.exp() * log_prob
q1, q2 = self.Q1(s_batch, a), self.Q2(s_batch, a)
q = torch.min(q1, q2)
pi_loss = -(q + entropy) # For gradient ascent
self.PI.optimizer.zero_grad()
pi_loss.mean().backward()
self.PI.optimizer.step()
# Alpha train
self.log_alpha_optimizer.zero_grad()
alpha_loss = -(self.log_alpha.exp() * (log_prob + self.target_entropy).detach()).mean()
alpha_loss.backward()
self.log_alpha_optimizer.step()
# Soft update of Q1 and Q2
for param_target, param in zip(self.Q1_target.parameters(), self.Q1.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)
for param_target, param in zip(self.Q2_target.parameters(), self.Q2.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)
def train_SACAgent():
# Initalize the SAC Agent and related variables required
agent = SACAgent()
env = gym.make('Pendulum-v1', g=9.81)
episodes = 800
total_rewards = []
no_of_steps = []
success_count = 0
frames = []
best_episode = 0
best_reward = float('-inf')
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
start_time = datetime.datetime.now()
counter = 0
while not done:
counter += 1
action, log_prob = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([action])
agent.memory.put((state, action, reward, state_prime, done))
score += reward
state = state_prime
if counter % 50 == 0 and score > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if agent.memory.size() > 1000:
agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200
total_rewards.append(score)
no_of_steps.append(counter)
if score > best_reward:
best_reward = score
best_episode = episode
# Saving the Models
save_folder = "SAC"
if not os.path.exists(save_folder):
os.makedirs(save_folder)
if episode == best_episode:
model_name = os.path.join(save_folder, "SAC" + str(episode) + ".pt")
torch.save(agent.PI.state_dict(), model_name)
if episode % 10 == 0:
elapsed_time = datetime.datetime.now() - start_time
print('Episode {:>4} | Total Reward: {:>8.2f} | Elapsed: {}'.format(episode, score, elapsed_time))
env.close()
return {
'total_rewards': total_rewards,
'no_of_steps': no_of_steps,
'success_count': success_count,
'frames': frames
}
SAC_results = train_SACAgent()
Episode 0 | Total Reward: -1263.81 | Elapsed: 0:00:00.091711 Episode 10 | Total Reward: -1489.46 | Elapsed: 0:00:02.265242 Episode 20 | Total Reward: -389.96 | Elapsed: 0:00:02.632590 Episode 30 | Total Reward: -127.36 | Elapsed: 0:00:02.051764 Episode 40 | Total Reward: -131.92 | Elapsed: 0:00:02.080935 Episode 50 | Total Reward: -368.68 | Elapsed: 0:00:01.994790 Episode 60 | Total Reward: -125.87 | Elapsed: 0:00:02.099870 Episode 70 | Total Reward: -495.16 | Elapsed: 0:00:02.093062 Episode 80 | Total Reward: -3.40 | Elapsed: 0:00:02.169272 Episode 90 | Total Reward: -0.80 | Elapsed: 0:00:02.194396 Episode 100 | Total Reward: -373.32 | Elapsed: 0:00:02.114902 Episode 110 | Total Reward: -115.08 | Elapsed: 0:00:02.178001 Episode 120 | Total Reward: -126.53 | Elapsed: 0:00:02.228045 Episode 130 | Total Reward: -252.04 | Elapsed: 0:00:02.115674 Episode 140 | Total Reward: -248.56 | Elapsed: 0:00:02.073931 Episode 150 | Total Reward: -130.85 | Elapsed: 0:00:02.063166 Episode 160 | Total Reward: -3.86 | Elapsed: 0:00:02.060495 Episode 170 | Total Reward: -5.87 | Elapsed: 0:00:01.966251 Episode 180 | Total Reward: -378.04 | Elapsed: 0:00:02.164835 Episode 190 | Total Reward: -133.99 | Elapsed: 0:00:02.046350 Episode 200 | Total Reward: -322.09 | Elapsed: 0:00:02.146470 Episode 210 | Total Reward: -131.22 | Elapsed: 0:00:02.798415 Episode 220 | Total Reward: -130.28 | Elapsed: 0:00:02.116329 Episode 230 | Total Reward: -122.91 | Elapsed: 0:00:02.080042 Episode 240 | Total Reward: -126.82 | Elapsed: 0:00:02.457065 Episode 250 | Total Reward: -241.15 | Elapsed: 0:00:02.072398 Episode 260 | Total Reward: -135.84 | Elapsed: 0:00:02.127494 Episode 270 | Total Reward: -128.68 | Elapsed: 0:00:02.092727 Episode 280 | Total Reward: -132.87 | Elapsed: 0:00:01.963818 Episode 290 | Total Reward: -253.82 | Elapsed: 0:00:01.983134 Episode 300 | Total Reward: -5.20 | Elapsed: 0:00:02.007278 Episode 310 | Total Reward: -244.54 | Elapsed: 0:00:02.125452 Episode 320 | Total Reward: -133.17 | Elapsed: 0:00:02.101796 Episode 330 | Total Reward: -252.38 | Elapsed: 0:00:02.108584 Episode 340 | Total Reward: -251.01 | Elapsed: 0:00:02.068382 Episode 350 | Total Reward: -241.99 | Elapsed: 0:00:02.210062 Episode 360 | Total Reward: -246.01 | Elapsed: 0:00:02.027896 Episode 370 | Total Reward: -253.02 | Elapsed: 0:00:01.965153 Episode 380 | Total Reward: -130.73 | Elapsed: 0:00:01.942947 Episode 390 | Total Reward: -131.26 | Elapsed: 0:00:02.073364 Episode 400 | Total Reward: -246.84 | Elapsed: 0:00:02.130810 Episode 410 | Total Reward: -345.02 | Elapsed: 0:00:02.230273 Episode 420 | Total Reward: -0.68 | Elapsed: 0:00:02.209572 Episode 430 | Total Reward: -228.04 | Elapsed: 0:00:02.294913 Episode 440 | Total Reward: -131.99 | Elapsed: 0:00:02.166283 Episode 450 | Total Reward: -130.42 | Elapsed: 0:00:02.070453 Episode 460 | Total Reward: -246.70 | Elapsed: 0:00:02.073242 Episode 470 | Total Reward: -233.23 | Elapsed: 0:00:02.109055 Episode 480 | Total Reward: -130.46 | Elapsed: 0:00:02.173595 Episode 490 | Total Reward: -122.46 | Elapsed: 0:00:02.161290 Episode 500 | Total Reward: -121.94 | Elapsed: 0:00:02.142550 Episode 510 | Total Reward: -231.43 | Elapsed: 0:00:02.151427 Episode 520 | Total Reward: -3.00 | Elapsed: 0:00:02.182354 Episode 530 | Total Reward: -132.74 | Elapsed: 0:00:02.005636 Episode 540 | Total Reward: -2.03 | Elapsed: 0:00:02.191058 Episode 550 | Total Reward: -2.99 | Elapsed: 0:00:02.204246 Episode 560 | Total Reward: -1.44 | Elapsed: 0:00:02.168411 Episode 570 | Total Reward: -132.42 | Elapsed: 0:00:02.168530 Episode 580 | Total Reward: -220.00 | Elapsed: 0:00:02.069063 Episode 590 | Total Reward: -126.75 | Elapsed: 0:00:02.156660 Episode 600 | Total Reward: -239.90 | Elapsed: 0:00:02.087662 Episode 610 | Total Reward: -134.35 | Elapsed: 0:00:02.060851 Episode 620 | Total Reward: -131.85 | Elapsed: 0:00:02.062815 Episode 630 | Total Reward: -5.70 | Elapsed: 0:00:02.174400 Episode 640 | Total Reward: -125.27 | Elapsed: 0:00:02.184379 Episode 650 | Total Reward: -242.39 | Elapsed: 0:00:02.187821 Episode 660 | Total Reward: -241.77 | Elapsed: 0:00:02.161412 Episode 670 | Total Reward: -128.78 | Elapsed: 0:00:02.088581 Episode 680 | Total Reward: -5.30 | Elapsed: 0:00:02.113722 Episode 690 | Total Reward: -132.38 | Elapsed: 0:00:02.080776 Episode 700 | Total Reward: -122.94 | Elapsed: 0:00:02.136434 Episode 710 | Total Reward: -129.84 | Elapsed: 0:00:02.132908 Episode 720 | Total Reward: -6.86 | Elapsed: 0:00:02.192942 Episode 730 | Total Reward: -126.04 | Elapsed: 0:00:02.191723 Episode 740 | Total Reward: -118.76 | Elapsed: 0:00:02.141104 Episode 750 | Total Reward: -246.72 | Elapsed: 0:00:02.325031 Episode 760 | Total Reward: -127.43 | Elapsed: 0:00:02.198647 Episode 770 | Total Reward: -121.37 | Elapsed: 0:00:02.181811 Episode 780 | Total Reward: -2.57 | Elapsed: 0:00:02.231232 Episode 790 | Total Reward: -243.67 | Elapsed: 0:00:02.121809
VISUALIZING THE PERFORMANCE FOR THE SOFT ACTOR-CRITIC MODEL
- Based on the results of the Soft Actor-Critic Model, we find that it actually performed better than the DQN models tested earlier. Although the DQN models were successful in balancing the pendulum, this SAC agent managed to achieve a much better average reward of -128.11.
- This shows that the Soft Actor-Critic model has its advantages in the Pendulum environment, possibly due to it being designed to handle continuous actions, while DQN typically deals with discrete action spaces and requires discretization of the action space for continuous action tasks.
- Another reason could also be that SAC employs a stochastic policy, which allows for efficient exploration in continuous action spaces.
# Calculating statistical measures
average_reward = np.mean(SAC_results['total_rewards'])
median_reward = np.median(SAC_results['total_rewards'])
max_reward = np.max(SAC_results['total_rewards'])
min_reward = np.min(SAC_results['total_rewards'])
# Identifying the best episode
best_episode_index = np.argmax(SAC_results['total_rewards'])
# Printing the Statistics
print("Performance Statistics for the SAC Model:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(SAC_results['total_rewards'], average_reward, model_name="SAC DQN")
Performance Statistics for the SAC Model: -------------------------------------------- Best Episode : 42 Average Reward : -188.68 Median Reward : -131.26 Maximum Reward : -0.14 Minimum Reward : -1850.31
VIEWING THE MODEL ARCHITECTURE AND PENDULUM ANIMATION
- Now, we will be looking at the model architecture used to train the SAC agent using the
.eval()function for PyTorch. - We will also be viewing the animation of the pendulum's movement and visualize how the pendulum behaves.
# Load and view the model's architecture used for SAC
trained_model = SACAgent()
trained_model.PI.load_state_dict(torch.load("SAC/SAC42.pt"))
trained_model.PI.eval()
PolicyNetwork( (fc_1): Linear(in_features=3, out_features=64, bias=True) (fc_2): Linear(in_features=64, out_features=64, bias=True) (fc_mu): Linear(in_features=64, out_features=1, bias=True) (fc_std): Linear(in_features=64, out_features=1, bias=True) )
TESTING OUR MODEL WEIGHTS
- There is no training involved
- It is to see if the saved model weights can keep the pendulum inverted
Note there is no need to create a new model because the choose_action function of SAC does not make use of randomly generated numbers to encourage exploration
test_agent(trained_model, 'SAC')
Test reward: -127.53441782760822
MODEL TRAINING EVOLUTION
- Visualize how the model has improved over each episode
# Visualizing the pendulum's animation
create_animation(SAC_results['frames']) # Visualizing the pendulum's animation
MODEL EVALUATION AND PERFORMANCE ANALYSIS¶
In this section, we will be performing an evaluation with 800 testing episodes for each model. For performance analysis and evaluation, we will be doing the following:
- Showing the evaluation statistics for each model
- Visualize Average Reward and Success Rate for each model
- Perform two sample independent t-test to determine statistical significance
- Evaluate the Model's Efficiency
PERFORMING CALCULATIONS
In this section, we will first perform the calculations necessary to evaluate the performance of each model. The following steps are carried out:
- Collect data from each model by running 800 testing episodes.
- Calculate the metrics (avg reward, std reward, avg steps, std steps, success rate) for each model.
class MetricsCalculator:
def __init__(self, total_rewards, no_of_steps, success_count, n_episodes, frames):
self.total_rewards = total_rewards
self.no_of_steps = no_of_steps
self.success_count = success_count
self.n_episodes = n_episodes
self.frames = frames
def avg_reward_per_episode(self):
sum_reward = np.sum(self.total_rewards)
return sum_reward / self.n_episodes
def std_reward_per_episode(self):
return np.std(self.total_rewards)
def avg_steps_taken(self):
step_count = np.sum(self.no_of_steps)
return step_count / self.n_episodes
def std_steps_taken(self):
return np.std(self.no_of_steps)
def avg_reward_per_step(self):
sum_reward = np.sum(self.total_rewards)
step_count = np.sum(self.no_of_steps)
return sum_reward / step_count
def success_rate(self):
return self.success_count / self.n_episodes
def render_frames(self):
create_animation(self.frames)
pass
DQN_metrics = MetricsCalculator(**DQN_results, n_episodes=800)
ImprovedDQN_metrics = MetricsCalculator(**ImprovedDQN_results, n_episodes=800)
DDQN_metrics = MetricsCalculator(**DDQN_results, n_episodes=800)
SAC_metrics = MetricsCalculator(**SAC_results, n_episodes=800)
def create_dataframe_from_dict(data_dict, column_name=None):
df = pd.DataFrame.from_dict(data_dict, orient='index')
if column_name:
df.columns = [column_name]
return df
PLOTTING THE REWARD BAR PLOT
- Here, we visualize the performance of various reinforcement learning models on the Pendulum task. The bar plot displays the average reward obtained by each model over the evaluation episodes.
- This helps when we are comparing the effectiveness of different algorithms and choose the best one for our application. They highlight perhaps the trade-off between stability and exploration in reinforcement learning and allow us to gain insights into the behavior of these models.
# Your dictionary
all_avg_reward_per_episode = {
'DQN': DQN_metrics.avg_reward_per_episode(),
'Improved DQN': ImprovedDQN_metrics.avg_reward_per_episode(),
'DDQN': DDQN_metrics.avg_reward_per_episode(),
'SAC': SAC_metrics.avg_reward_per_episode()
}
# Convert the dictionary to a DataFrame
df = create_dataframe_from_dict(all_avg_reward_per_episode, 'Avg_Reward_Per_Episode')
df
| Avg_Reward_Per_Episode | |
|---|---|
| DQN | -340.944229 |
| Improved DQN | -545.086096 |
| DDQN | -569.351038 |
| SAC | -176.178789 |
SAC had the largest average reward per episode, indicating its impressive ability to consistently achieve high rewards. Surprisingly, DQN performed better in this evaluation than the Improved/Enhanced DQN.
# Sort the DataFrame by 'Avg_Reward_Per_Episode' in ascending order
df = df.sort_values(by='Avg_Reward_Per_Episode', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Average Reward")
ax = fig.subplots()
sns.barplot(
data=df,
y='Avg_Reward_Per_Episode',
x=df.index, # Swap x and y axes
ax=ax,
palette=sns.color_palette('Set2')
)
# ax.legend()
ax.set_ylabel('Avg Reward Per Episode') # Swap x and y axis labels
ax.set_xlabel('Model') # Swap x and y axis labels
plt.show()
PLOTTING THE SUCCESS RATE OF THE MODELS
- Now, we will take a look at the success rates of the models to find out which model has the highest rate of succeeding at the Pendulum task.
# Your dictionary
all_success_rate = {
'DQN': DQN_metrics.success_rate(),
'Improved DQN': ImprovedDQN_metrics.success_rate(),
'DDQN': DDQN_metrics.success_rate(),
'SAC': SAC_metrics.success_rate()
}
# Convert the dictionary to a DataFrame
df = create_dataframe_from_dict(all_success_rate, 'success_rate')
df
| success_rate | |
|---|---|
| DQN | 0.2450 |
| Improved DQN | 0.1525 |
| DDQN | 0.1575 |
| SAC | 0.0725 |
Success rate means "How often does a model improve on its previous results".
DQN had the higheset success rate because its training was quite irregular. However, whenever it performed badly, it was able to correct itself quickly in the next episode. It displays its inability to adapt to changing environments as it tries the same policy on a continuous environment which causes it to fail for that particular episode, but is able to learn from that and improve the very next episode.
SAC was the lowest in this evaluation because it achieve success and stability of attaining high rewards very early on which provided it with less opportunities to "bounce back" from unfavourable episodes.
# Sort the DataFrame by 'success_rate' in ascending order
df = df.sort_values(by='success_rate', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Success Rate")
ax = fig.subplots()
sns.barplot(
data=df,
y='success_rate',
x=df.index, # Swap x and y axes
ax=ax,
palette=sns.color_palette('Set2')
)
ax.set_ylabel('Success rate')
ax.set_xlabel('Model')
plt.show()
TWO SAMPLE INDEPENDENT T-TEST
Next, we perform a two sample independent t-test between the top two models to determine if there is any statistical significance. This is because although the mean of one model may be higher than another model, if the standard deviations are large enough, in fact the significance of that difference in mean may just be due to randomness.
Likewise, it may be possible that the means between two models appear very similar to each other, however they may actually be significantly different if the standard deviations are small. Although the large number of episodes may reduce the visual difference effect has on our analysis, it is much better to perform a statistical test.
Null Hypothesis(H0): Average results from the different models are identical
Alternate Hypothesis(H1): Average results from the different models are not identical
With a 95% confidence level, from the test results, all models but one are different from each other (H0 is rejected). This is shown by the very small p-value, except for Improved DQN and DDQN which had a p-value of 0.09 (> 0.05), therefore H0 cannot be rejected.
import numpy as np
from scipy import stats
def two_sample_t_test(mean1, std1, mean2, std2, n1, n2):
t, p = stats.ttest_ind_from_stats(mean1, std1, n1, mean2, std2, n2)
return p
# Define the model metrics
dqn_avg = DQN_metrics.avg_reward_per_episode()
dqn_std = DQN_metrics.std_reward_per_episode()
improved_dqn_avg = ImprovedDQN_metrics.avg_reward_per_episode()
improved_dqn_std = ImprovedDQN_metrics.std_reward_per_episode()
ddqn_avg = DDQN_metrics.avg_reward_per_episode()
ddqn_std = DDQN_metrics.std_reward_per_episode()
sac_avg = SAC_metrics.avg_reward_per_episode()
sac_std = SAC_metrics.std_reward_per_episode()
# Sample sizes
n1 = 800
n2 = 800
# Perform two-sample t-tests and print the results
models = ["DQN", "Improved_DQN", "DDQN", "SAC"]
for i in range(len(models)):
for j in range(i+1, len(models)):
model1 = models[i]
model2 = models[j]
p_value = two_sample_t_test(
eval(f"{model1.lower()}_avg"),
eval(f"{model1.lower()}_std"),
eval(f"{model2.lower()}_avg"),
eval(f"{model2.lower()}_std"),
n1, n2
)
print(f"Two-sample t-test between {model1} and {model2}: p-value = {p_value}")
Two-sample t-test between DQN and Improved_DQN: p-value = 1.8851302977853567e-45 Two-sample t-test between DQN and DDQN: p-value = 4.460181319774324e-55 Two-sample t-test between DQN and SAC: p-value = 1.4132384619600633e-40 Two-sample t-test between Improved_DQN and DDQN: p-value = 0.09470622155215883 Two-sample t-test between Improved_DQN and SAC: p-value = 2.3701112197964183e-152 Two-sample t-test between DDQN and SAC: p-value = 1.0529611551567295e-166
ANALYSIS OF MODEL EFFICIENCY
Now, we will be analyzing the efficiency of our models. We define efficiency as the ability to achieve more with less, which in our case would be asking "How good is a model at gaining rewards without the extensive use of steps?". We can assess this with two metrics:
Step Count. This metric is self explanatory, we track the number of steps each model takes. The higher score indicates which models on average take longer to finish, while vice versa for the lower.
Efficiency Score. We calculate this metric by performing:
Essentially, the higher the efficiency score, the better it is and the lower the efficiency score, the least efficient. Generally, we want to look for a model that has a high efficiency score with a low step count.
all_avg_reward_per_step = {
'DQN': DQN_metrics.avg_reward_per_step(),
'Improved DQN': ImprovedDQN_metrics.avg_reward_per_step(),
'DDQN': DDQN_metrics.avg_reward_per_step(),
'SAC': SAC_metrics.avg_reward_per_step()
}
# Convert the dictionary to a DataFrame
df = create_dataframe_from_dict(all_avg_reward_per_step, 'avg_reward_per_step')
df
| avg_reward_per_step | |
|---|---|
| DQN | -1.819133 |
| Improved DQN | -2.990152 |
| DDQN | -3.121399 |
| SAC | -0.880894 |
Not surprisingly, SAC had the highest efficiency score, with its ability to understand the environment so quickly, achieve incredible results before the 100 episode mark. It is able to master this task with much less experience training, the embodiment of achieving more with less.
# Sort the DataFrame by 'Avg_Reward_Per_Episode' in ascending order
df = df.sort_values(by='avg_reward_per_step', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Efficiency Scores")
ax = fig.subplots()
sns.barplot(
data=df,
y='avg_reward_per_step',
x=df.index, # Swap x and y axes
ax=ax,
palette=sns.color_palette('Set2')
)
# ax.legend()
ax.set_ylabel('Avg reward per step')
ax.set_xlabel('Model')
plt.show()
HYPERPARAMETER TUNING OF MODELS¶
- Based on our earlier results, we will be exploring hyperparameter tuning methods for
Soft Actor-Critic Modeland see if we are able to further improve the performances of our best performing model.
MODIFYING THE SOFT ACTOR-CRITIC MODEL
- Adding the hyperparameters to the
__init__()function of the class
class SACAgentTuning:
def __init__(
self,
state_dim=3,
action_dim=1,
lr_pi=0.001,
lr_q=0.001,
gamma=0.98,
batch_size=200,
buffer_limit=100000,
tau=0.005,
init_alpha=0.01,
lr_alpha=0.005,
):
self.state_dim = state_dim
self.action_dim = action_dim
self.lr_pi = lr_pi
self.lr_q = lr_q
self.gamma = gamma
self.batch_size = batch_size
self.buffer_limit = buffer_limit
self.tau = tau
self.init_alpha = init_alpha
self.target_entropy = -self.action_dim
self.lr_alpha = lr_alpha
self.memory = ReplayBuffer(self.buffer_limit)
self.log_alpha = torch.tensor(np.log(self.init_alpha))
self.log_alpha.requires_grad = True
self.log_alpha_optimizer = optim.Adam([self.log_alpha], lr=self.lr_alpha)
self.PI = PolicyNetwork(self.state_dim, self.action_dim, self.lr_pi)
self.Q1 = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q1_target = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q2 = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q2_target = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q1_target.load_state_dict(self.Q1.state_dict())
self.Q2_target.load_state_dict(self.Q2.state_dict())
def choose_action(self, s):
with torch.no_grad():
action, log_prob = self.PI.sample(s)
return action, log_prob
def calc_target(self, mini_batch):
s, a, r, s_prime, done = mini_batch
with torch.no_grad():
a_prime, log_prob_prime = self.PI.sample(s_prime)
entropy = - self.log_alpha.exp() * log_prob_prime
q1_target, q2_target = self.Q1_target(s_prime, a_prime), self.Q2_target(s_prime, a_prime)
q_target = torch.min(q1_target, q2_target)
target = r + self.gamma * done * (q_target + entropy)
return target
def train_agent(self):
mini_batch = self.memory.sample(self.batch_size)
s_batch, a_batch, r_batch, s_prime_batch, done_batch = mini_batch
td_target = self.calc_target(mini_batch)
# Training of Q1
q1_loss = F.smooth_l1_loss(self.Q1(s_batch, a_batch), td_target)
self.Q1.optimizer.zero_grad()
q1_loss.mean().backward()
self.Q1.optimizer.step()
# Training of Q2
q2_loss = F.smooth_l1_loss(self.Q2(s_batch, a_batch), td_target)
self.Q2.optimizer.zero_grad()
q2_loss.mean().backward()
self.Q2.optimizer.step()
# Training of PI
a, log_prob = self.PI.sample(s_batch)
entropy = -self.log_alpha.exp() * log_prob
q1, q2 = self.Q1(s_batch, a), self.Q2(s_batch, a)
q = torch.min(q1, q2)
pi_loss = -(q + entropy) # For gradient ascent
self.PI.optimizer.zero_grad()
pi_loss.mean().backward()
self.PI.optimizer.step()
# Alpha train
self.log_alpha_optimizer.zero_grad()
alpha_loss = -(self.log_alpha.exp() * (log_prob + self.target_entropy).detach()).mean()
alpha_loss.backward()
self.log_alpha_optimizer.step()
# Soft update of Q1 and Q2
for param_target, param in zip(self.Q1_target.parameters(), self.Q1.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)
for param_target, param in zip(self.Q2_target.parameters(), self.Q2.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)
HYPERPARAMETER TUNING FUNCTION
def hp_tune_SACAgent(config):
# Initalize the DQN hp_Agent and related variables required
hp_agent = SACAgentTuning(**config)
env = gym.make("Pendulum-v1", g=9.81)
episodes = 800
total_rewards = []
no_of_steps = []
success_count = 0
best_reward = float('-inf')
# Get hypertuning checkpoint
if train.get_checkpoint():
loaded_checkpoint = train.get_checkpoint()
with loaded_checkpoint.as_directory() as loaded_checkpoint_dir:
model_state = torch.load(
os.path.join(loaded_checkpoint_dir, "checkpoint.pt")
)
hp_agent.load_state_dict(model_state)
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
counter = 0
while not done:
counter += 1
action, log_prob = hp_agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([action])
hp_agent.memory.put((state, action, reward, state_prime, done))
score += reward
state = state_prime
if hp_agent.memory.size() > 1000:
hp_agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200
total_rewards.append(score)
no_of_steps.append(counter)
if score > best_reward:
best_reward = score
# Saving Checkpoint
metrics = {
"avg_reward": np.mean(total_rewards),
}
with tempfile.TemporaryDirectory() as tempdir:
torch.save(
hp_agent.PI.state_dict(),
os.path.join(tempdir, "checkpoint.pt"),
)
train.report(metrics=metrics, checkpoint=Checkpoint.from_directory(tempdir))
env.close()
RUNNING HYPERPARAMETER TUNING
- Search space defined is a reasonable range of values where the best results should occur
- We are using
ASHASchedulerwhich is an alias toAsyncHyperBandScheduler. It is a scheduler used for hyperparameter optimization in distributed machine learning and neural architecture search (NAS). It efficiently manages multiple trials with different hyperparameter configurations, uses early stopping, and is designed for parallel and asynchronous execution, making it useful for finding optimal hyperparameters while utilizing multiple computing resources.
search_space = {
"state_dim": 3, # Fixed for the environment
"action_dim": 1, # Example choices for action_dim
"lr_pi": tune.loguniform(1e-4, 0.1), # Loguniform search for lr_pi
"lr_q": tune.loguniform(1e-4, 0.1), # Loguniform search for lr_q
"gamma": tune.choice([0.95, 0.98, 0.99]), # choices for gamma
"batch_size": tune.choice([100, 200, 300]), # choices for batch_size
"buffer_limit": tune.choice([50000, 100000, 200000]), # Choices for buffer_limit
"tau": tune.uniform(0.001, 0.01), # Uniform search for tau
"init_alpha": tune.loguniform(1e-4, 0.1), # Loguniform search for init_alpha
"lr_alpha": tune.loguniform(1e-4, 0.1), # Loguniform search for lr_alpha
}
scheduler = ASHAScheduler(
max_t=800,
grace_period=1,
reduction_factor=2
)
tuner = tune.Tuner(
tune.with_resources(
tune.with_parameters(hp_tune_SACAgent),
resources={"cpu": 2}
),
tune_config=tune.TuneConfig(
metric="avg_reward",
mode="max",
scheduler=scheduler,
num_samples=10,
),
param_space=search_space,
)
results = tuner.fit()
best_trial = results.get_best_result("avg_reward", "max")
print(f"Best trial config: {best_trial.config}")
print(f"Best trial final average reward: {best_trial.metrics['avg_reward']}")
Tune Status
| Current time: | 2024-01-28 08:56:43 |
| Running for: | 00:00:00.19 |
| Memory: | 12.7/15.2 GiB |
System Info
Using AsyncHyperBand: num_stopped=0Bracket: Iter 512.000: None | Iter 256.000: None | Iter 128.000: None | Iter 64.000: None | Iter 32.000: None | Iter 16.000: None | Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:G)
Trial Status
| Trial name | status | loc | batch_size | buffer_limit | gamma | init_alpha | lr_alpha | lr_pi | lr_q | tau |
|---|---|---|---|---|---|---|---|---|---|---|
| hp_tune_SACAgent_1a2e2_00000 | PENDING | 200 | 100000 | 0.98 | 0.0245411 | 0.021885 | 0.000551092 | 0.0178988 | 0.00284066 | |
| hp_tune_SACAgent_1a2e2_00001 | PENDING | 300 | 50000 | 0.95 | 0.0091431 | 0.0136917 | 0.0040095 | 0.00476492 | 0.00873231 | |
| hp_tune_SACAgent_1a2e2_00002 | PENDING | 200 | 200000 | 0.99 | 0.0347088 | 0.0153973 | 0.000388744 | 0.00658863 | 0.00809449 | |
| hp_tune_SACAgent_1a2e2_00003 | PENDING | 100 | 50000 | 0.95 | 0.000760478 | 0.000133917 | 0.0325391 | 0.00393464 | 0.00478841 | |
| hp_tune_SACAgent_1a2e2_00004 | PENDING | 200 | 50000 | 0.99 | 0.00702418 | 0.000293844 | 0.0819245 | 0.000352949 | 0.0087974 | |
| hp_tune_SACAgent_1a2e2_00005 | PENDING | 100 | 200000 | 0.95 | 0.00360496 | 0.035589 | 0.00308427 | 0.0183588 | 0.00899611 | |
| hp_tune_SACAgent_1a2e2_00006 | PENDING | 300 | 50000 | 0.99 | 0.00146163 | 0.00187331 | 0.000450539 | 0.00013053 | 0.00887594 | |
| hp_tune_SACAgent_1a2e2_00007 | PENDING | 100 | 50000 | 0.99 | 0.000231165 | 0.000205124 | 0.000133754 | 0.0013325 | 0.00743987 | |
| hp_tune_SACAgent_1a2e2_00008 | PENDING | 300 | 200000 | 0.98 | 0.0576165 | 0.00344381 | 0.0255 | 0.00927923 | 0.00424883 | |
| hp_tune_SACAgent_1a2e2_00009 | PENDING | 300 | 100000 | 0.95 | 0.00665678 | 0.00979623 | 0.054478 | 0.0014454 | 0.00730881 |
(hp_tune_SACAgent pid=22804) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\core.py:317: DeprecationWarning: WARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future. (hp_tune_SACAgent pid=22804) deprecation( (hp_tune_SACAgent pid=22804) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\wrappers\step_api_compatibility.py:39: DeprecationWarning: WARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future. (hp_tune_SACAgent pid=22804) deprecation( (hp_tune_SACAgent pid=22804) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\numpy\core\fromnumeric.py:43: FutureWarning: The input object of type 'Tensor' is an array-like implementing one of the corresponding protocols (`__array__`, `__array_interface__` or `__array_struct__`); but not a sequence (or 0-D). In the future, this object will be coerced as if it was first converted using `np.array(obj)`. To retain the old behaviour, you have to either modify the type 'Tensor', or assign to an empty array created with `np.empty(correct_shape, dtype=object)`. (hp_tune_SACAgent pid=22804) result = getattr(asarray(obj), method)(*args, **kwds) (hp_tune_SACAgent pid=22804) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\utils\passive_env_checker.py:241: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`. (Deprecated NumPy 1.24) (hp_tune_SACAgent pid=22804) if not isinstance(terminated, (bool, np.bool8)): (hp_tune_SACAgent pid=22804) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00000_0_batch_size=200,buffer_limit=100000,gamma=0.9800,init_alpha=0.0245,lr_alpha=0.0219,lr_pi=0.0006,lr_q_2024-01-28_08-56-42/checkpoint_000000) (hp_tune_SACAgent pid=12028) C:\Users\zzhen\AppData\Local\Temp\ipykernel_34388\552776624.py:21: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_new.cpp:264.) (hp_tune_SACAgent pid=9112) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\core.py:317: DeprecationWarning: WARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future. [repeated 7x across cluster] (hp_tune_SACAgent pid=9112) deprecation( [repeated 14x across cluster] (hp_tune_SACAgent pid=9112) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\wrappers\step_api_compatibility.py:39: DeprecationWarning: WARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future. [repeated 7x across cluster] (hp_tune_SACAgent pid=9112) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\numpy\core\fromnumeric.py:43: FutureWarning: The input object of type 'Tensor' is an array-like implementing one of the corresponding protocols (`__array__`, `__array_interface__` or `__array_struct__`); but not a sequence (or 0-D). In the future, this object will be coerced as if it was first converted using `np.array(obj)`. To retain the old behaviour, you have to either modify the type 'Tensor', or assign to an empty array created with `np.empty(correct_shape, dtype=object)`. [repeated 7x across cluster] (hp_tune_SACAgent pid=9112) result = getattr(asarray(obj), method)(*args, **kwds) [repeated 7x across cluster] (hp_tune_SACAgent pid=9112) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\utils\passive_env_checker.py:241: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`. (Deprecated NumPy 1.24) [repeated 7x across cluster] (hp_tune_SACAgent pid=9112) if not isinstance(terminated, (bool, np.bool8)): [repeated 7x across cluster] (hp_tune_SACAgent pid=10164) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00006_6_batch_size=300,buffer_limit=50000,gamma=0.9900,init_alpha=0.0015,lr_alpha=0.0019,lr_pi=0.0005,lr_q=_2024-01-28_08-56-42/checkpoint_000007) [repeated 26x across cluster] (hp_tune_SACAgent pid=10164) C:\Users\zzhen\AppData\Local\Temp\ipykernel_34388\552776624.py:21: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_new.cpp:264.) (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000011) [repeated 6x across cluster] (hp_tune_SACAgent pid=10164) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00006_6_batch_size=300,buffer_limit=50000,gamma=0.9900,init_alpha=0.0015,lr_alpha=0.0019,lr_pi=0.0005,lr_q=_2024-01-28_08-56-42/checkpoint_000013) [repeated 7x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000018) [repeated 6x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000022) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000026) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000030) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000034) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000038) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000042) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000046) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000050) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000054) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000058) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000062) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000066) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000070) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000074) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000078) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000082) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000086) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000090) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000094) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000098) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000102) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000106) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000110) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000114) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000118) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000122) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000126) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000130) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000134) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000138) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000142) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000146) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000150) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000154) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000158) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000161) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000164) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000167) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000170) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000173) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000176) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000180) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000184) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000188) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000192) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000196) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000199) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000203) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000207) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000211) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000215) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000219) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000223) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000227) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000231) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000235) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000239) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000243) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000247) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000251) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000255) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000259) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000263) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000267) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000271) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000275) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000279) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000283) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000287) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000291) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000295) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000299) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000303) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000307) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000310) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000313) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000316) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000320) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000324) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000328) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000332) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000336) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000340) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000344) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000348) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000351) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000355) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000358) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000360) [repeated 2x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000363) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000366) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000369) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000372) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000375) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000378) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000382) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000386) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000390) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000394) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000398) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000402) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000406) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000410) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000414) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000418) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000421) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000425) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000429) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000433) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000437) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000441) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000445) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000448) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000451) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000455) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000459) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000463) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000467) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000471) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000475) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000478) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000482) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000486) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000490) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000494) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000497) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000501) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000505) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000509) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000513) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000517) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000520) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000524) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000528) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000532) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000536) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000539) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000543) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000547) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000551) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000555) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000559) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000563) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000567) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000571) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000575) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000579) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000583) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000587) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000591) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000595) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000599) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000603) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000607) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000611) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000615) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000619) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000623) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000627) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000631) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000635) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000639) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000643) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000647) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000651) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000655) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000659) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000663) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000667) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000671) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000675) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000679) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000683) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000687) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000691) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000695) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000699) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000703) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000707) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000711) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000715) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000719) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000723) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000726) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000729) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000732) [repeated 3x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000736) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000740) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000744) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000748) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000752) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000756) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000760) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000764) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000768) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000772) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000776) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000780) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000784) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000788) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000792) [repeated 4x across cluster] (hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000796) [repeated 4x across cluster] 2024-01-28 09:17:19,506 INFO tune.py:1042 -- Total run time: 1236.72 seconds (1236.64 seconds for the tuning loop).
Best trial config: {'state_dim': 3, 'action_dim': 1, 'lr_pi': 0.0003887437422389239, 'lr_q': 0.006588627430399412, 'gamma': 0.99, 'batch_size': 200, 'buffer_limit': 200000, 'tau': 0.008094487127446998, 'init_alpha': 0.03470881719479883, 'lr_alpha': 0.015397298925206759}
Best trial final total reward: -169.30299748599347
BEST MODEL AFTER TUNING
best_hp_agent = SACAgentTuning(**best_trial.config)
EVALUATING HYPERTUNED MODEL
- Retrain the SAC model with the best configuration
def train_best_SACAgent(best_hp_agent:SACAgentTuning):
# Initalize the SAC Agent and related variables required
agent = best_hp_agent
env = gym.make('Pendulum-v1', g=9.81)
episodes = 800
total_rewards = []
no_of_steps = []
success_count = 0
frames = []
best_episode = 0
best_reward = float('-inf')
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
start_time = datetime.datetime.now()
counter = 0
while not done:
counter += 1
action, log_prob = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([action])
agent.memory.put((state, action, reward, state_prime, done))
score += reward
state = state_prime
if counter % 50 == 0 and score > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if agent.memory.size() > 1000:
agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200 or score > -2
total_rewards.append(score)
no_of_steps.append(counter)
if score > best_reward:
best_reward = score
best_episode = episode
# Saving the Models
save_folder = "Tuned_SAC"
if not os.path.exists(save_folder):
os.makedirs(save_folder)
if episode == best_episode:
model_name = os.path.join(save_folder, "Tuned_SAC" + str(episode) + ".pt")
torch.save(agent.PI.state_dict(), model_name)
if episode % 10 == 0:
elapsed_time = datetime.datetime.now() - start_time
print('Episode {:>4} | Total Reward: {:>8.2f} | Elapsed: {}'.format(episode, score, elapsed_time))
env.close()
return {
'total_rewards': total_rewards,
'no_of_steps': no_of_steps,
'success_count': success_count,
'frames': frames
}
tuned_SAC_results = train_best_SACAgent(best_hp_agent)
Episode 0 | Total Reward: -887.64 | Elapsed: 0:00:00.055890 Episode 10 | Total Reward: -665.24 | Elapsed: 0:00:02.148514 Episode 20 | Total Reward: -131.11 | Elapsed: 0:00:01.387094 Episode 30 | Total Reward: -370.03 | Elapsed: 0:00:01.326462 Episode 40 | Total Reward: -234.36 | Elapsed: 0:00:01.342169 Episode 50 | Total Reward: -120.40 | Elapsed: 0:00:01.368868 Episode 60 | Total Reward: -1.85 | Elapsed: 0:00:01.395959 Episode 70 | Total Reward: -122.71 | Elapsed: 0:00:01.352037 Episode 80 | Total Reward: -126.02 | Elapsed: 0:00:01.462505 Episode 90 | Total Reward: -121.53 | Elapsed: 0:00:01.382721 Episode 100 | Total Reward: -237.28 | Elapsed: 0:00:01.378617 Episode 110 | Total Reward: -356.86 | Elapsed: 0:00:01.403869 Episode 120 | Total Reward: -126.44 | Elapsed: 0:00:01.356247 Episode 130 | Total Reward: -130.07 | Elapsed: 0:00:01.407759 Episode 140 | Total Reward: -245.72 | Elapsed: 0:00:01.416777 Episode 150 | Total Reward: -229.19 | Elapsed: 0:00:01.447425 Episode 160 | Total Reward: -117.94 | Elapsed: 0:00:01.322713 Episode 170 | Total Reward: -328.76 | Elapsed: 0:00:01.434287 Episode 180 | Total Reward: -248.94 | Elapsed: 0:00:01.437368 Episode 190 | Total Reward: -232.39 | Elapsed: 0:00:01.411170 Episode 200 | Total Reward: -224.91 | Elapsed: 0:00:01.512126 Episode 210 | Total Reward: -236.58 | Elapsed: 0:00:01.390656 Episode 220 | Total Reward: -127.59 | Elapsed: 0:00:01.421051 Episode 230 | Total Reward: -1.18 | Elapsed: 0:00:01.456356 Episode 240 | Total Reward: -1.49 | Elapsed: 0:00:01.440719 Episode 250 | Total Reward: -122.12 | Elapsed: 0:00:01.414578 Episode 260 | Total Reward: -125.92 | Elapsed: 0:00:01.506192 Episode 270 | Total Reward: -129.42 | Elapsed: 0:00:01.422304 Episode 280 | Total Reward: -11.53 | Elapsed: 0:00:01.502813 Episode 290 | Total Reward: -135.59 | Elapsed: 0:00:01.443406 Episode 300 | Total Reward: -121.21 | Elapsed: 0:00:01.478148 Episode 310 | Total Reward: -122.59 | Elapsed: 0:00:01.431704 Episode 320 | Total Reward: -231.50 | Elapsed: 0:00:01.484078 Episode 330 | Total Reward: -119.16 | Elapsed: 0:00:01.394267 Episode 340 | Total Reward: -125.35 | Elapsed: 0:00:01.460871 Episode 350 | Total Reward: -124.25 | Elapsed: 0:00:01.465501 Episode 360 | Total Reward: -117.02 | Elapsed: 0:00:01.420217 Episode 370 | Total Reward: -232.21 | Elapsed: 0:00:01.431042 Episode 380 | Total Reward: -237.55 | Elapsed: 0:00:01.431311 Episode 390 | Total Reward: -344.62 | Elapsed: 0:00:01.429877 Episode 400 | Total Reward: -129.17 | Elapsed: 0:00:01.467225 Episode 410 | Total Reward: -235.88 | Elapsed: 0:00:01.484058 Episode 420 | Total Reward: -128.39 | Elapsed: 0:00:01.464355 Episode 430 | Total Reward: -114.32 | Elapsed: 0:00:01.441448 Episode 440 | Total Reward: -117.72 | Elapsed: 0:00:01.445694 Episode 450 | Total Reward: -227.68 | Elapsed: 0:00:01.454484 Episode 460 | Total Reward: -125.78 | Elapsed: 0:00:01.412873 Episode 470 | Total Reward: -246.13 | Elapsed: 0:00:01.456214 Episode 480 | Total Reward: -121.11 | Elapsed: 0:00:01.464099 Episode 490 | Total Reward: -122.82 | Elapsed: 0:00:01.452174 Episode 500 | Total Reward: -246.19 | Elapsed: 0:00:01.527423 Episode 510 | Total Reward: -122.98 | Elapsed: 0:00:01.512263 Episode 520 | Total Reward: -225.45 | Elapsed: 0:00:01.514035 Episode 530 | Total Reward: -119.67 | Elapsed: 0:00:01.442444 Episode 540 | Total Reward: -127.23 | Elapsed: 0:00:01.546772 Episode 550 | Total Reward: -119.20 | Elapsed: 0:00:01.459459 Episode 560 | Total Reward: -126.79 | Elapsed: 0:00:01.508984 Episode 570 | Total Reward: -224.91 | Elapsed: 0:00:01.478620 Episode 580 | Total Reward: -123.65 | Elapsed: 0:00:01.483825 Episode 590 | Total Reward: -123.34 | Elapsed: 0:00:01.468968 Episode 600 | Total Reward: -128.07 | Elapsed: 0:00:01.514391 Episode 610 | Total Reward: -0.92 | Elapsed: 0:00:01.528736 Episode 620 | Total Reward: -121.43 | Elapsed: 0:00:01.467581 Episode 630 | Total Reward: -120.26 | Elapsed: 0:00:01.576493 Episode 640 | Total Reward: -239.47 | Elapsed: 0:00:01.495190 Episode 650 | Total Reward: -340.64 | Elapsed: 0:00:01.584836 Episode 660 | Total Reward: -127.64 | Elapsed: 0:00:01.542311 Episode 670 | Total Reward: -238.21 | Elapsed: 0:00:01.695179 Episode 680 | Total Reward: -223.48 | Elapsed: 0:00:01.623173 Episode 690 | Total Reward: -1.11 | Elapsed: 0:00:01.575105 Episode 700 | Total Reward: -4.74 | Elapsed: 0:00:01.554986 Episode 710 | Total Reward: -128.50 | Elapsed: 0:00:01.606051 Episode 720 | Total Reward: -120.96 | Elapsed: 0:00:01.527371 Episode 730 | Total Reward: -126.23 | Elapsed: 0:00:01.557759 Episode 740 | Total Reward: -127.12 | Elapsed: 0:00:01.543473 Episode 750 | Total Reward: -341.28 | Elapsed: 0:00:01.555745 Episode 760 | Total Reward: -127.68 | Elapsed: 0:00:01.679928 Episode 770 | Total Reward: -225.67 | Elapsed: 0:00:01.607428 Episode 780 | Total Reward: -120.50 | Elapsed: 0:00:01.576473 Episode 790 | Total Reward: -117.61 | Elapsed: 0:00:01.671018
# Calculating statistical measures
average_reward = np.mean(tuned_SAC_results['total_rewards'])
median_reward = np.median(tuned_SAC_results['total_rewards'])
max_reward = np.max(tuned_SAC_results['total_rewards'])
min_reward = np.min(tuned_SAC_results['total_rewards'])
# Identifying the best episode
best_episode_index = np.argmax(tuned_SAC_results['total_rewards'])
# Printing the Statistics
print("Performance Statistics for the SAC Model:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(tuned_SAC_results['total_rewards'], average_reward, model_name="SAC DQN")
Performance Statistics for the SAC Model: -------------------------------------------- Best Episode : 789 Average Reward : -168.98 Median Reward : -125.16 Maximum Reward : -0.24 Minimum Reward : -1604.26
TESTING OUR MODEL WEIGHTS
- There is no training involved.
- It is to see if the saved model weights can keep the pendulum inverted.
config = {
"state_dim": 3,
"action_dim": 1,
"lr_pi": 0.0003887437422389239,
"lr_q": 0.006588627430399412,
"gamma": 0.99,
"batch_size": 200,
"buffer_limit": 200000,
"tau": 0.008094487127446998,
"init_alpha": 0.03470881719479883,
"lr_alpha": 0.015397298925206759,
}
agent = SACAgentTuning(**config)
agent.PI.load_state_dict(torch.load('./Tuned_SAC/Tuned_SAC789.pt'))
test_agent(agent, 'SAC')
MODEL TRAINING EVOLUTION
- Visualize how the model has improved over each episode
# Visualizing the pendulum's animation
create_animation(tuned_SAC_results['frames']) # Visualizing the pendulum's animation
HYPERTUNED MODEL EVALUATION¶
We will be evaluating the hypertuned SAC model with the rest of the models with the same metrics we did earlier to understand if the tuned model indeed performed better.
AVERAGE REWARD BAR PLOT¶
tuned_SAC_metrics = MetricsCalculator(**tuned_SAC_results, n_episodes=800)
all_avg_reward_per_episode = {
'DQN': DQN_metrics.avg_reward_per_episode(),
'Improved DQN': ImprovedDQN_metrics.avg_reward_per_episode(),
'DDQN': DDQN_metrics.avg_reward_per_episode(),
'SAC': SAC_metrics.avg_reward_per_episode(),
'Tuned SAC': tuned_SAC_metrics.avg_reward_per_episode(),
}
df = create_dataframe_from_dict(all_avg_reward_per_episode, 'Avg_Reward_Per_Episode')
df
| Avg_Reward_Per_Episode | |
|---|---|
| DQN | -340.944229 |
| Improved DQN | -545.086096 |
| DDQN | -569.351038 |
| SAC | -176.178789 |
| Tuned SAC | -168.979288 |
# Sort the DataFrame by 'Avg_Reward_Per_Episode' in ascending order
df = df.sort_values(by='Avg_Reward_Per_Episode', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Average Reward")
ax = fig.subplots()
sns.barplot(
data=df,
y='Avg_Reward_Per_Episode',
x=df.index,
ax=ax,
palette=sns.color_palette('Set2')
)
# ax.legend()
ax.set_ylabel('Avg Reward Per Episode')
ax.set_xlabel('Model')
plt.show()
SUCCESS RATE¶
all_success_rate = {
'DQN': DQN_metrics.success_rate(),
'Improved DQN': ImprovedDQN_metrics.success_rate(),
'DDQN': DDQN_metrics.success_rate(),
'SAC': SAC_metrics.success_rate(),
'Tuned SAC': tuned_SAC_metrics.success_rate(),
}
# Convert the dictionary to a DataFrame
df = create_dataframe_from_dict(all_success_rate, 'success_rate')
df
| success_rate | |
|---|---|
| DQN | 0.2450 |
| Improved DQN | 0.1525 |
| DDQN | 0.1575 |
| SAC | 0.0725 |
| Tuned SAC | 0.1100 |
# Sort the DataFrame by 'success_rate' in ascending order
df = df.sort_values(by='success_rate', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Improvement Rate")
ax = fig.subplots()
sns.barplot(
data=df,
y='success_rate',
x=df.index, # Swap x and y axes
ax=ax,
palette=sns.color_palette('Set2')
)
ax.set_ylabel('Improvement rate')
ax.set_xlabel('Model')
plt.show()
Model Efficiency¶
all_avg_reward_per_step = {
'DQN': DQN_metrics.avg_reward_per_step(),
'Improved DQN': ImprovedDQN_metrics.avg_reward_per_step(),
'DDQN': DDQN_metrics.avg_reward_per_step(),
'SAC': SAC_metrics.avg_reward_per_step(),
'Tuned SAC': tuned_SAC_metrics.avg_reward_per_step(),
}
# Convert the dictionary to a DataFrame
df = create_dataframe_from_dict(all_avg_reward_per_step, 'avg_reward_per_step')
# Sort the DataFrame by 'Avg_Reward_Per_Episode' in ascending order
df = df.sort_values(by='avg_reward_per_step', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Efficiency Scores")
ax = fig.subplots()
sns.barplot(
data=df,
y='avg_reward_per_step',
x=df.index, # Swap x and y axes
ax=ax,
palette=sns.color_palette('Set2')
)
# ax.legend()
ax.set_ylabel('Avg reward per step')
ax.set_xlabel('Model')
plt.show()
CONCLUSION OF PENDULUM REINFORCEMENT LEARNING¶
Reinforcement learning is a powerful and promising field in the area of Artificial Intelligence. With its ability to learn through trial and error and make decisions in dynamic environments, it has been successfully applied to various problems, ranging from gaming to robotics. Pendulum is a classic example of how reinforcement learning can be used to solve classic control problems in a simulated environment.
We have successfully tackled the Pendulum problem through the use of Reinforcement Learning algorithms, namely DQN, DDQN and SAC. Through this project, we have evaluated the models on various aspects such as performance, efficiency, robustness and feature importance. Our findings have provided valuable insights into the behavior of the algorithms and the intricacies of Reinforcement Learning.
This project has been a challenging yet enlightening experience and has helped us to gain a deeper understanding of Reinforcement Learning concepts. We hope that our work can contribute to the development of more advanced Reinforcement Learning models in the future.